Graphcore benchmarks outperform Nvidia

Author: EIS Release Date: Jan 7, 2021


Graphcore has released the first set of performance benchmarks for its latest AI compute systems – the IPU-M2000 and scale-up IPU-POD64.


Across a range of popular models, Graphcore technologies significantly outperformed nVidia’s A100 (DGX-based), both at training and inference.
Highlights include:
Training

EfficientNet-B4:      18x higher throughput
ResNeXt 101:          3.7x higher throughput
BERT-Large:            5.3x faster time to train on IPU-POD64 vs DGX A100 (>2.6X faster than dual-DGX systems)
Inference
LSTM:                        >600x throughput at lower latency
EfficientNet-B0:      60x throughput / >16x lower latency
ResNeXt 101:          40x throughput / 10x lower latency
BERT-Large:            3.4x higher throughput at lower latency
[Full Graphcore benchmarks.   nVidia benchmarks.]
Included in the benchmarks are results for BERT-Large, a Transformer-based natural language processing model, running across all 64 processors of an IPU-POD64.
With a time to train 5.3x faster than the latest Nvidia DGX-A100 (equating to >2.6x faster than a dual-DGX setup), the BERT-Large result illustrates the strength of Graphcore’s IPU-POD scale-out solution for datacentres, and the power of the Poplar software stack to manage complex workloads that take advantage of multiple processors working in parallel.
Commenting on the results, Matt Fyles, SVP Software at Graphcore said: “This comprehensive suite of benchmarks demonstrates that Graphcore’s IPU-M2000 and the IPU-POD64 are outperforming GPUs across many popular models.
“The benchmarks for newer models, such as EfficientNet are particularly illuminating as they demonstrate how AI’s direction of travel increasingly favours the IPU’s specialist architecture over the legacy design of graphics processing units.
“That gap is only going to widen as customers demand compute systems that can handle sparsity to run massive models efficiently, things that the Graphcore IPU was built to excel at.”
MLCommons
In addition to publishing comprehensive benchmarks for its AI compute systems, Graphcore has announced its membership of MLCommons, the newly-established body overseeing MLPerf.
Graphcore will be participating in MLCommons’ comparative benchmarking process, from 2021. For more information, see MLCommons’ launch announcement.
Shipping now
The release of Graphcore’s new benchmarks coincides with the availability of IPU-M2000 and IPU-POD64 systems to customers worldwide. A number of early shipments are already installed and running in datacentres.
Sales are being supported by Graphcore’s global partner network, and by the company’s own sales and field engineering teams in Europe, Asia and the Americas.
PyTorch and Poplar 1.4
Graphcore users can now take advantage of v1.4 of the Poplar SDK, including full PyTorch support. PyTorch has become the framework of choice for developers working on cutting-edge AI research, as well as garnering a fast-growing following in the wider AI community.
Latest data from PapersWithCode shows that 47% of published papers with associated code used the PyTorch framework (September 2020)
The addition of PyTorch support, combined with Poplar’s existing support for TensorFlow, means that the vast majority of AI applications can now be easily deployed on Graphcore systems.
As with other elements of the Poplar stack, Graphcore is open-sourcing its PyTorch for IPU interface library, allowing the community to contribute to, and accelerate its development.
About the IPU-M2000 and IPU-POD
The IPU-Machine M2000 is a plug-and-play Machine Intelligence compute blade that has been designed for easy deployment and supports systems that can grow to massive scale.
The 1U blade delivers one PetaFlop of Machine Intelligence compute and includes integrated networking technology, optimized for AI scale-out, inside the box.
Each IPU-Machine M2000 is powered by four of Graphcore’s new 7nm Colossus™ Mk2 GC200 IPU processors, and is fully supported by our Poplar software stack.
The IPU-POD64 is Graphcore’s scale-out solution comprising 16 IPU-M2000 machines, pre-configured and connected using Graphcore’s ultra-high bandwidth IPU-Fabric technology.
IPU-POD64 is designed for customers requiring large-scale AI compute capability, either to run single workloads across multiple IPUs for parallel computation, or for shared use by multiple users via Graphcore’s Virtual-IPU software.