MLPerf Benchmarks

The NVIDIA AI platform achieves world-class performance and versatility in MLPerf Training, Inference, and HPC benchmarks for the most demanding, real-world AI workloads.

What Is MLPerf?

MLPerf™ benchmarks—developed by MLCommons, a consortium of AI leaders from academia, research labs, and industry—are designed to provide unbiased evaluations of training and inference performance for hardware, software, and services. They’re all conducted under prescribed conditions. To stay on the cutting edge of industry trends, MLPerf continues to evolve, holding new tests at regular intervals and adding new workloads that represent the state of the art in AI.

Inside the MLPerf Benchmarks

MLPerf Inference v4.1 measures inference performance on nine different benchmarks, including several large language models (LLMs), text-to-image, natural language processing, recommenders, computer vision, and medical image segmentation.

MLPerf Training v4.1 measures the time to train on seven different benchmarks, including LLM pre-training, LLM fine-tuning, text-to-image, graph neural network (GNN), computer vision, recommendation, and natural language processing.

MLPerf HPC v3.0 measures training performance across four different scientific computing use cases, including climate atmospheric river identification, cosmology parameter prediction, quantum molecular modeling, and protein structure prediction. 

Large Language Models

Deep learning algorithms trained on large-scale datasets that can recognize, summarize, translate, predict, and generate content for a breadth of use cases.
Details.

Text-to-Image

Generates images from text prompts.
Details.

Recommendation

Delivers personalized results in user-facing services such as social media or ecommerce websites by understanding interactions between users and service items, like products or ads.
Details.

Object Detection (Lightweight)

Finds instances of real-world objects such as faces, bicycles, and buildings in images or videos and specifies a bounding box around each.
Details.

Graph Neural Network

Uses neural networks designed to work with data structured as graphs.
Details.

Image Classification

Assigns a label from a fixed set of categories to an input image, i.e., applies to computer vision problems.
Details.

Natural Language Processing (NLP)

Understands text by using the relationship between different words in a block of text. Allows for question answering, sentence paraphrasing, and many other language-related use cases.
Details.

Biomedical Image Segmentation

Performs volumetric segmentation of dense 3D images for medical use cases.
Details.

Climate Atmospheric River Identification

Identify hurricanes and atmospheric rivers in climate simulation data.
Details.

Cosmology Parameter Prediction

Solve a 3D image regression problem on cosmological data.
Details.

Quantum Molecular Modeling

Predict energies or molecular configurations.
Details.

Protein Structure Prediction

Predicts three-dimensional protein structure based on one-dimensional amino acid connectivity.
Details.

NVIDIA MLPerf Benchmark Results

The NVIDIA HGX™ B200 platform, powered by NVIDIA Blackwell GPUs, fifth-generation NVLink™, and the latest NVLink Switch, delivered yet another giant leap for LLM training in MLPerf Training v4.1. Through relentless full-stack engineering at data center scale, NVIDIA continues to push the boundaries of generative AI training performance, accelerating the creation and customization of increasingly capable AI models.

NVIDIA Blackwell Supercharges LLM Training

MLPerf™ Training v4.1 results retrieved from https://mlcommons.org on November 13, 2024, from the following entries: 4.1-0060 (HGX H100, 2024) in the available category, 4.1-0082 (HGX B200, 2024) in the preview category. MLPerf™ Training v3.0 results, used for HGX H100 (2023), retrieved from entry 3.0-2069. HGX A100 result not verified by MLCommons association.  The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See https://mlcommons.org for more information.

NVIDIA Continues to Deliver the Highest Performance at Scale

The NVIDIA platform, powered by NVIDIA Hopper™ GPUs, fourth-generation NVLink with third-generation NVSwitch™, and Quantum-2 InfiniBand, continued to demonstrate unmatched performance and versatility in MLPerf Training v4.1. NVIDIA delivered the highest performance at scale on all seven benchmarks.

Max-Scale Performance

Benchmark Time to Train Number of GPUs
LLM (GPT-3 175B) 3.4 minutes 11,616
LLM Fine-Tuning (Llama 2 70B-LoRA) 1.2 minutes 1,024
Text-to-Image (Stable Diffusion v2) 1.4 minutes 1,024
Graph Neural Network (R-GAT) 0.9 minutes 512
Recommender (DLRM-DCNv2) 1.0 minutes 128
Natural Language Processing (BERT) 0.1 minutes 3,472
Object Detection (RetinaNet) 0.8 minutes 2,528

MLPerf™ Training v4.1 results retrieved from https://mlcommons.org on November 13, 2024, from the following entries: NVIDIA 4.0-0058,  NVIDIA 4.0-0053, NVIDIA 4.0-0007, NVIDIA 4.0-0054, NVIDIA 4.0-0053, NVIDIA + CoreWeave 4.0-0008, NVIDIA 4.0-0057, NVIDIA 4.0-0056, NVIDIA 4.0-0067. The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See https://mlcommons.org for more information.

In its MLPerf Inference debut, the NVIDIA Blackwell platform with the NVIDIA Quasar Quantization System delivered up to 4X higher LLM performance compared to the prior generation H100 Tensor Core GPU. Among available solutions, the NVIDIA H200 Tensor Core GPU, based on the NVIDIA Hopper architecture, delivered the highest performance per GPU for generative AI, including on all three LLM benchmarks, which included Llama 2 70B, GPT-J and the newly added mixture-of-experts LLM, Mixtral 8x7B, as well as on the Stable Diffusion XL text-to-image benchmark. Through relentless software optimization, H200’s performance increased by up to 27 percent in less than six months. For generative AI at the edge, NVIDIA Jetson Orin™ delivered outstanding results, boosting GPT-J throughput by more than 6X and reducing latency by 2.4X just in one round.

NVIDIA Blackwell Delivers Giant Leap for LLM Inference

Server

4X

 

Offline

3.7X

 

AI Superchip

208B Transistors

2nd Gen Transformer Engine

FP4/FP6 Tensor Core

5th Generation NVLINK

Scales to 576 GPUs

RAS Engine

100% In-System Self-Test

Secure AI

Full Performance Encryption and TEE

Decompression Engine

800 GB/Sec


MLPerf Inference v4.1 Closed, Data Center. Results retrieved from https://mlcommons.org on August 28, 2024. Blackwell results measured on single GPU and retrieved from entry 4.1-0074 in the Closed, Preview category. H100 results from entry 4.1-0043 in the Closed, Available category on an 8x H100 system and divided by GPU count for per-GPU comparison. Per-GPU throughput is not a primary metric of MLPerf Inference. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See https://mlcommons.org for more information.

H200 Delivers Exceptional Multi-GPU Inference Throughput on Every Benchmark

Benchmark Offline Server
Llama 2 70B 34,864 tokens/second 32,790 tokens/second
Mixtral 8x7B 59,022 tokens/second 57,177 tokens/second
GPT-J 20,086 tokens/second 19,243 tokens/second
Stable Diffusion XL 17.42 samples/second 16.78 queries/second
DLRMv2 99% 637,342 samples/second 585,202 queries/second
DLRMv2 99.9% 390,953 samples/second 370,083 queries/second
BERT 99% 73,310 samples/second 57,609 queries/second
BERT 99.9% 63,950 samples/second 51,212 queries/second
RetinaNet 14,439 samples/second 13,604 queries/second
ResNet-50 v1.5 756,960 samples/second 632,229 queries/second
3D U-Net 54.71 samples/second Not part of benchmark



MLPerf Inference v4.1 Closed, Data Center. Results retrieved from https://mlcommons.org on August 28, 2024. All results using eight GPUs and retrieved from the following entries: 4.1-0046, 4.1-0048, 4.1-0050. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See https://mlcommons.org for more information.

The NVIDIA H100 Tensor Core supercharged the NVIDIA platform for HPC and AI in its MLPerf HPC v3.0 debut, enabling up to 16X faster time to train in just three years and delivering the highest performance on all workloads across both time-to-train and throughput metrics. The NVIDIA platform was also the only one to submit results for every MLPerf HPC workload, which span climate segmentation, cosmology parameter prediction, quantum molecular modeling, and the latest addition, protein structure prediction. The unmatched performance and versatility of the NVIDIA platform makes it the instrument of choice to power the next wave of AI-powered scientific discovery.

Up to 16X More Performance in Three Years

NVIDIA Full-Stack Innovation Fuels Performance Gains

MLPerf™ HPC v3.0 Results retrieved from https://mlcommons.org on November 8, 2023. Results retrieved from entries 0.7-406, 0.7-407, 1.0-1115, 1.0-1120, 1.0-1122,  2.0-8005, 2.0-8006,  3.0-8006, 3.0-8007, 3.0-8008. CosmoFlow score in v1.0 is normalized to new RCPs introduced in MLPerf HPC v2.0. Scores for v0.7, v1.0, and v2.0 are adjusted to remove data staging time from the benchmark, consistent with new rules adopted for v3.0 to enable fair comparisons between the submission rounds. The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See https://mlcommons.org for more information.

 

MLPerf™ HPC v3.0 Results retrieved from https://mlcommons.org on November 8, 2023. Results retrieved from entries 3.0-8004, 3.0-8009, and 3.0-8010.  The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See https://mlcommons.org for more information.

The Technology Behind the Results

The complexity of AI demands a tight integration between all aspects of the platform. As demonstrated in MLPerf’s benchmarks, the NVIDIA AI platform delivers leadership performance with the world’s most advanced GPU, powerful and scalable interconnect technologies, and cutting-edge software—an end-to-end solution that can be deployed in the data center, in the cloud, or at the edge with amazing results.

Optimized Software that Accelerates AI Workflows

An essential component of NVIDIA’s platform and MLPerf training and inference results, the NGC™ catalog is a hub for GPU-optimized AI, HPC, and data analytics software that simplifies and accelerates end-to-end workflows. With over 150 enterprise-grade containers—including workloads for generative AI, conversational AI, and recommender systems; hundreds of AI models; and industry-specific SDKs that can be deployed on premises, in the cloud, or at the edge—NGC enables data scientists, researchers, and developers to build best-in-class solutions, gather insights, and deliver business value faster than ever.

Leadership-Class AI Infrastructure

Achieving world-leading results across training and inference requires infrastructure that’s purpose-built for the world’s most complex AI challenges. The NVIDIA AI platform delivered leading performance powered by the NVIDIA Blackwell platform, the Hopper platform, NVLink™, NVSwitch™, and Quantum InfiniBand. These are at the heart of the NVIDIA data center platform, the engine behind our benchmark performance.

In addition, NVIDIA DGX™ systems offer the scalability, rapid deployment, and incredible compute power that enable every enterprise to build leadership-class AI infrastructure. 

Unlocking Generative AI at the Edge With Transformative Performance

NVIDIA Jetson Orin offers unparalleled AI compute, large unified memory, and comprehensive software stacks, delivering superior energy efficiency to drive the latest generative AI applications. It’s capable of fast inference for any generative AI models powered by the transformer architecture, providing superior edge performance on MLPerf.

Learn more about our data center training and inference performance.

Large Language Models

MLPerf Training uses the GPT-3 generative language model with 175 billion parameters and a sequence length of 2,048 on the C4 dataset for the LLM pre-training workload. For the LLM fine-tuning test, the Llama 2 70B model with the GovReport dataset with sequence lengths of 8,192.

MLPerf Inference uses the Llama 2 70B model with the OpenORCA dataset; the Mixtral 8x7B model with the OpenORCA, GSM8K, and MBXP datasets; and the GPT-J model with the CNN-DailyMail dataset.

Text-to-Image

MLPerf Training uses the Stable Diffusion v2 text-to-image model trained on the LAION-400M-filtered dataset.

MLPerf Inference uses the Stable Diffusion XL (SDXL) text-to-image model with a subset of 5,000 prompts from the coco-val-2014 dataset. 

Recommendation

MLPerf Training and Inference use the Deep Learning Recommendation Model v2 (DLRMv2) that employs DCNv2 cross-layer and a multi-hot dataset synthesized from the Criteo dataset.

Object Detection (Lightweight)

MLPerf Training uses Single-Shot Detector (SSD) with ResNeXt50 backbone on a subset of the Google OpenImages dataset.

Graph Neural Network

MLPerf Training uses R-GAT with the Illinois Graph Benchmark (IGB) - Heterogeneous dataset.

Image Classification

MLPerf Inference uses ResNet v1.5 with the ImageNet dataset.

Natural Language Processing (NLP)

MLPerf Training uses Bidirectional Encoder Representations from Transformers (BERT) on the Wikipedia 2020/01/01 dataset.

MLPerf Inference uses BERT with the SQuAD v.1.1 dataset.

Biomedical Image Segmentation

MLPerf Inference uses 3D U-Net with the KiTS19 dataset.

Climate Atmospheric River Identification

Uses the DeepCAM model with CAM5 + TECA simulation dataset.

Cosmology Parameter Prediction

Uses the CosmoFlow model with the CosmoFlow N-body simulation dataset.

Quantum Molecular Modeling

Uses the DimeNet++ model with the Open Catalyst 2020 (OC20) dataset.

Protein Structure Prediction

Uses the OpenFold model trained on the OpenProteinSet dataset.