Part of one of Google’s Cloud TPU v4 pods
Digital transformation is responsible for creating artificial intelligence workloads on an unprecedented scale. These workloads require companies to collect and store mountains of data. While business intelligence is extracted from current machine learning models, new data streams are used to create new models and update existing ones.
To advertise
Building AI models is complex and expensive. It is also very different from traditional software development. AI models require specialized hardware for accelerated computing and high-performance storage, as well as purpose-built infrastructure to handle the technical nuances of AI.
In today’s world, many critical business decisions and customer service depend on accurate machine learning insights. To train, run, and scale models as quickly and accurately as possible, a company has the knowledge to choose the best hardware and software for its machine learning applications.
Performance Calibration
ML Commons
MLCommons is an open engineering consortium that has made it easier for companies to make machine learning decisions with its standardized benchmarking. Her mission is to make machine learning better for everyone. Testing and unbiased comparisons help companies determine which vendor best meets their AI application requirements. The MLCommons Foundation started the first MLPerf benchmarking in 2018.
MLcommons recently ran a benchmarking program called MLPerf v2.0 Training to measure the performance of hardware and software used to train machine learning models. 250 performance results were reported by 21 different bidders, including Azure, Baidu
BIDU
BIDU
GOOG
GOOG
the
the
VIDIA
The
IAD
VIDIA
The
IAD
This series of tests aimed to determine how long it takes to train different neural networks. Faster model training speeds up model implementation, impacting the total cost of ownership and ROI of the model.
A new object detection benchmark has been added to MLPerf Training 2.0, which trains the new RetinaNet reference model on a larger and more diverse dataset called Open Images. This new test reflects state-of-the-art ML training for applications such as vehicle and robotics collision avoidance, retail analytics and much more.
Results
Machine learning has seen a lot of innovation since 2021, both in terms of hardware and software. For the first time since MLPerf’s debut, Google’s cloud-based TPU v4 ML supercomputer outperformed NVIDIA A100 in four out of eight training tests for language (2), computer vision (4), reinforcement learning (1), and recommendation systems (1). .
The higher, the better. TPUs showed significant acceleration in all five benchmarks conducted during the … [+]
According to the chart comparing the performance of Google and nvidia, Google had the fastest training times for BERT (language), ResNet (image recognition), RetinaNet (object detection) and MaskRCNN (image recognition). On DLRM (recommendation), Google narrowly removed NVIDIA, but it was a research project and not available for public use.
Overall, Google submitted scores for five of the eight benchmarks, the best training times are shown below:
Data: MLCommons
Speaking to Vikram Kasivajhula, Google’s director of product management for ML infrastructure, I asked what approach Google was taking to make such dramatic improvements to TPU v4.
“We focused on the problems of heavy model users innovating at the frontiers of machine learning,” he said. “Our cloud product is actually a realization of that goal. We also focused on performance per dollar. As you can imagine, these models grow incredibly large and expensive to train. One of our priorities is to make sure it’s affordable. »
A unique entry
A unique entry was created for MLPerf Training 2.0 by Stanford student Tri Dao. Dao has submitted an 8-A100 system for BERT training.
NVIDIA also had an entry with the same setup as Dao. I suspect this was a courtesy of NVIDIA to give Dao a documented point of comparison.
NVIDIA completed the training of the BERT model with its 8-A100 in 18,442 minutes, while Dao’s entry took only 17,402 minutes. He got faster training time using a method called FlashAttention. Attention is a technique that mimics cognitive attention. The effect improves some parts of the input data while reducing others – the motivation being that the network should focus more on the small but important parts of the data.
wrap
Over the past three years, Google has made a lot of progress with its TPU. Likewise, NVIDIA has been using its A100 successfully for four years now. Many of the software improvements have been brought to the A100 as evidenced by its long history of performance.
We’ll likely see NVIDIA entries in 2023 with both the A100 and the new H100, a beast by any current standard. Everyone was hoping to see the performance of the H100 this year, but NVIDIA didn’t submit it because it wasn’t publicly available.
Software improvements in general were evident in the latest results. Kasivajhula said hardware is only half the story of Google’s improved benchmarks. The other half was software optimization.
“Many of the optimizations have been learned from our own leading YouTube and search benchmark use cases,” he said. “We are now making them available to users. »
Google has also made several performance improvements to the virtualization stack to fully utilize the computing power of CPU hosts and TPU chips. The results of Google’s software improvements have been demonstrated by the top performance on the image and recommendation models.
Overall, Google’s Cloud TPUs deliver significant performance and cost savings at scale. It will take some time to see if the benefits are enough to entice more customers to switch to Google Cloud TPUs.
In the longer term, Google’s better results in the major categories could predict that NVIDIA will have fewer MLPerf results in the future. It is in the interest of the ecosystem to see strong controversies between multiple vendors for the best MLPerf performance results.
One thing is for sure, MLPerf Training 2.0 was much more interesting than previous rounds in which NVIDIA achieved performance gains in almost every category.
The full results of MLPerf Training 2.0 are available here.
Paul Smith-Goodson is Vice President and Principal Analyst for Quantum Computing, Artificial Intelligence and Space at Moor Insights and Strategy. You can follow him on babbling for up-to-date information on quantum, AI and space.
Note: Moor Insights & Strategy writers and editors may have contributed to this article.
Moor Insights & Strategy, like all analyst firms in the research and technology sector, offers or has provided paid services to technology companies. These services include research, analysis, consulting, consulting, benchmarking, acquisition matching and speaking sponsorship. Company has had or currently has paid relationships with 8×8, Accenture
ACN
ACN
ATEN
ATEN
AMD
AMD
AMZN
AMZN
T
T
AVGO
AVGO
CALX
CALX
CSCO
CSCO
CLDR
CLDR
VALLEY
VALLEY
EXT
EXT
vmw
vmw
IBM
IBM
JBL
JBL
the
the
MRVL
MRVL
MU
MU
MSFT
MSFT
ITENE
ITENE
the
the
QCOM
QCOM
NUDE
NUDE
ORCL
ORCL
PANW
PANW
pxLX
pxLX
PLT
PLT
the
the
RMBS
RMBS
RHT
RHT
s
s
NLOK
NLOK
SYNA
SYNA
the
the
CDT
CDT
VZ
VZ
XLNX
XLNX
ZEN
ZEN
SZ
SZ