Overview
AI inference—how we experience AI through chatbots, copilots, and creative tools—is scaling at a double exponential pace. User adoption is accelerating while the AI tokens generated per interaction, driven by agentic workflows, long-thinking reasoning, and mixture-of-experts (MoE) models, soars in parallel.
To enable inference at this massive scale, NVIDIA delivers data-center-scale architecture on an annual rhythm. Our extreme hardware and software codesign delivers order-of-magnitude leaps in performance and drives down the cost per token, making advanced AI experiences economically viable at scale.
NVIDIA GB300 NVL72 delivers 50x tokens per watt and 35x lower token cost over Hopper™, maximizing revenue within the same power budget and driving higher profit margins. Continuous software optimizations extract maximum performance at chip, rack, and data center scale, further improving return on investment over time.
Benefits
With extreme hardware and software codesign, NVIDIA GB300 NVL72 delivers 50x tokens per watt over Hopper, maximizing AI factory revenue within the same power budget. Continuous software optimizations extract maximum performance at chip, rack, and data center scale, further improving return on investment over time.
NVIDIA GB300 NVL72 system delivers 35x lower cost per token over NVIDIA Hopper platform, driving higher profit margins for AI factories. With each generation, performance improvements far outpace infrastructure costs, creating better economics to enable advanced AI experiences at massive scale.
NVIDIA supports every model across generative AI, traditional ML, scientific computing, biology, and physical AI. From latency-sensitive real-time applications to high-throughput batch processing, NVIDIA delivers the best performance for every use case. The platform provides maximum flexibility and programmability to choose the optimal configuration for evolving workload and business requirements.
NVIDIA’s production-ready software, including Dynamo and TensorRT™ LLM, and native integration with leading frameworks such as PyTorch, vLLM, SGLang, and llm-d, deliver the most robust AI inference stack. As model architectures and inference techniques rapidly evolve, NVIDIA’s stack ensures the fastest path from innovation to production.
Platform
Powerful hardware without smart orchestration wastes potential; great software without fast hardware means sluggish inference performance. NVIDIA’s inference platform delivers a continuously optimized full-stack solution with codesigned compute, networking, storage, and software to enable the highest performance across diverse workloads.
Explore some of the key NVIDIA hardware and software innovations.
Customer Stories
Resources
GB300 NVL72 delivers AI inference at $0.123 per million tokens at 116 TPS/user interactivity using NVIDIA Dynamo and TensorRT™-LLM—the lowest cost per token among major platforms, according to SemiAnalysis InferenceX benchmarks as of April 2026.
NVIDIA Blackwell Ultra (GB300 NVL72) delivers up to 50x higher throughput per megawatt and up to 35x lower cost per token than NVIDIA Hopper™ for low-latency agentic workloads, through hardware–software codesign, according to SemiAnalysis InferenceX benchmarks (Q1 2026). The GB300 NVL72 combines 72 Blackwell Ultra GPUs with 288 GB HBM3e per GPU in a single rack-scale system, all interconnected through NVIDIA NVLink™ Switch into a unified NVLink fabric delivering 130 TB/s of bandwidth. This architecture minimizes all-to-all communication latency, enabling large-scale Mixture-of-Experts (MoE) models like DeepSeek-R1 to scale expert parallelism efficiently across up to 72 GPUs simultaneously.
Only looking at compute pricing or FLOPs per dollar gives an incomplete view of inference TCO. The most important metric for AI inference TCO is cost per token, or the price-performance actually delivered. GB300 NVL72 delivers AI inference at $0.123 per million tokens at 116 TPS/user interactivity using NVIDIA Dynamo and TensorRT-LLM—the lowest cost per token among major platforms, according to SemiAnalysis InferenceX benchmarks as of April 2026.
When evaluating inference TCO, it’s important to look at large-scale Mixture-of-Experts (MoE) and reasoning models such as DeepSeek-R1. Nearly all of the latest closed and open source LLMs have adopted MoE and reasoning architectures, due to their superior intelligence and efficiency. By evaluating these models for inference TCO, you ensure your analysis is representative of what will likely be deployed.
Next Steps