AI Inference

NVIDIA Dynamo

Scale and Serve Generative AI, Fast.

Overview

Low Latency Distributed Inference
for Generative AI

NVIDIA Dynamo is an open-source modular inference framework for serving generative AI models in distributed environments. It enables seamless scaling of inference workloads across large GPU fleets with dynamic resource scheduling, intelligent request routing, optimized memory management, and accelerated data transfer.

When serving the open-source DeepSeek-R1 671B reasoning model on NVIDIA GB200 NVL72, NVIDIA Dynamo increased the number of requests served by up to 30x, making it the ideal solution for AI factories looking to run at the lowest possible cost to maximize token revenue generation.

NVIDIA Dynamo supports all major AI inference backends and features large language model (LLM)-specific optimizations, such as disaggregated serving, accelerating and scaling AI reasoning models at the lowest cost and with the highest efficiency. It will be supported as a part of NVIDIA AI Enterprise in a future release.

What Is Distributed Inference?

Distributed inference is the process of running AI model inference across multiple computing devices or nodes to maximize throughput by parallelizing computations. 

This approach enables efficient scaling for large-scale AI applications, such as generative AI, by distributing workloads across GPUs or cloud infrastructure. Distributed inference improves overall performance and resource utilization by allowing users to optimize latency and throughput for the unique requirements of each workload.

Features

Explore the Features of NVIDIA Dynamo

Disaggregated serving icon

Disaggregated Serving

Separates LLM context (prefill) and generation (decode) phases across distinct GPUs, enabling tailored model parallelism and independent GPU allocation to increase requests served per GPU.

GPU planner icon

GPU Planner

Monitors GPU capacity in distributed inference environments and dynamically allocates GPU workers across context and generation phases  to resolve bottlenecks and optimize performance.

Smart Router

Smart Router

Routes inference traffic efficiently, minimizing costly recomputation of repeat or overlapping requests to preserve compute resources while ensuring balanced load distribution across large GPU fleets.

File icon

NIXL Low-Latency Communication Library

Accelerates data movement in distributed inference settings while simplifying transfer complexities across diverse hardware, including GPUs, CPUs, networks, and storage.

Benefits

The Benefits of NVIDIA Dynamo

Scalability icon

Seamlessly Scale From One GPU to Thousands of GPUs

Streamline and automate GPU cluster setup with prebuilt, easy-to-deploy tools and enable dynamic autoscaling with real-time LLM-specific metrics, avoiding over or under provisioning of GPU resources.

Serving icon

Increase Inference Serving Capacity While Reducing Costs

Leverage advanced LLM inference serving optimizations like disaggregated serving to increase the number of inference requests served without compromising user experience.

Checkbox icon

Future-Proof Your AI Infrastructure and Avoid Costly Migrations

Open and modular design allows you to easily pick and choose the inference-serving components that suit your unique needs, ensuring compatibility with your existing AI stack and avoiding costly migration projects.

Iterative process icon

Accelerate Time to Deploy New AI Models in Production

NVIDIA Dynamo’s support for all major frameworks—including TensorRT-LLM, vLLM, SGLang, PyTorch, and more—ensures your ability to quickly deploy new generative AI models, regardless of their backend.

Accelerate Distributed Inference

NVIDIA Dynamo is fully open source, giving you complete transparency and flexibility. Deploy NVIDIA Dynamo, contribute to its growth, and seamlessly integrate it into your existing stack.

 Check it out on GitHub and join the community!

Develop

For individuals looking to get access to Triton Inference Server open-source code for development.

Develop

For individuals looking to access free Triton Inference Server containers for development.

Experience

Access NVIDIA-hosted infrastructure and guided hands-on labs that include step-by-step instructions and examples, available for free on NVIDIA LaunchPad.

Deploy

Get a free license to try NVIDIA AI Enterprise in production for 90 days using your existing infrastructure.  

Use Cases

Deploying AI with NVIDIA Dynamo

Find out how you can drive innovation with NVIDIA Dynamo.

Serving Reasoning Models

Reasoning models generate more tokens to solve complex problems, increasing inference costs. NVIDIA Dynamo optimizes these models with features like disaggregated serving. This approach separates the prefill and decode computational phases onto distinct GPUs, allowing AI inference teams to optimize each phase independently. The result is better resource utilization, more queries served per GPU,  and lower inference costs.

AI Reasoning Model Serving

Customer Testimonials

See What Industry Leaders Have to Say About NVIDIA Dynamo

Cohere

Cohere

“Scaling advanced AI models requires sophisticated multi-GPU scheduling, seamless coordination and low-latency communication libraries that transfer reasoning contexts seamlessly across memory and storage. We expect Dynamo will help us deliver a premier user experience to our enterprise customers.” Saurabh Baji, Senior Vice President of Engineering at Cohere

perplexity

Perplexity AI

"Handling hundreds of millions of requests monthly, we rely on NVIDIA’s GPUs and inference software to deliver the performance, reliability, and scale our business and users demand, "We'll look forward to leveraging Dynamo with its enhanced distributed serving capabilities to drive even more inference serving efficiencies and meet the compute demands of new AI reasoning models." Denis Yarats, CTO of Perplexity AI.

Together.ai

Together AI

“Scaling reasoning models cost-effectively requires new advanced inference techniques, including disaggregated serving and context-aware routing. Together AI provides industry leading performance using our proprietary inference engine. The openness and modularity of Dynamo will allow us to seamlessly plug its components into our engine to serve more requests while optimizing resource utilization—maximizing our accelerated computing investment. " Ce Zhang, CTO of Together AI.

Cohere

Cohere

“Scaling advanced AI models requires sophisticated multi-GPU scheduling, seamless coordination and low-latency communication libraries that transfer reasoning contexts seamlessly across memory and storage. We expect NVIDIA Dynamo will help us deliver a premier user experience to our enterprise customers.” Saurabh Baji, Senior Vice President of Engineering at Cohere

Perplexity

Perplexity AI

"Handling hundreds of millions of requests monthly, we rely on NVIDIA’s GPUs and inference software to deliver the performance, reliability, and scale our business and users demand, "We'll look forward to leveraging NVIDIA Dynamo with its enhanced distributed serving capabilities to drive even more inference serving efficiencies and meet the compute demands of new AI reasoning models." Denis Yarats, CTO of Perplexity AI.

Together.ai

Together AI

“Scaling reasoning models cost-effectively requires new advanced inference techniques, including disaggregated serving and context-aware routing. Together AI provides industry leading performance using our proprietary inference engine. The openness and modularity of NVIDIA Dynamo will allow us to seamlessly plug its components into our engine to serve more requests while optimizing resource utilization—maximizing our accelerated computing investment." Ce Zhang, CTO of Together AI.

Adopters

Leading Adopters Across All Industries

Amazon
American Express
Azure AI Translator
Encord
GE Healthcare
InfoSys
Intelligent Voice
Nio
Siemens Energy
Trax Retail
USPS
Yahoo Japan

Resources

The Latest in NVIDIA Inference

Get the Latest News

Get the Latest News

Read about the latest inference updates and announcements for NVIDIA Dynamo Inference Server.

Explore Technical Blogs

Explore Technical Blogs

Read technical walkthroughs on how to get started with inference.

Take a Deep Dive

Take a Deep Dive

Get tips and best practices for deploying, running, and scaling AI models for inference for generative AI, LLMs, recommender systems, computer vision, and more.

Next Steps

Ready to Get Started?

Download on GitHub and join the community!

decorative

For Developers

Explore everything you need to start developing with NVIDIA Dynamo, including the latest documentation, tutorials, technical blogs, and more.

decorative

Get in Touch

Talk to an NVIDIA product specialist about moving from pilot to production with the security, API stability, and support of NVIDIA AI Enterprise.

Read the Press Release | Read the Tech Blog

Get the Latest News

Get the Latest News

Read about the latest inference updates and announcements for Dynamo Inference Server.

Explore Technical Blogs

Explore Technical Blogs

Read technical walkthroughs on how to get started with inference.

Take a Deep Dive

Take a Deep Dive

Get tips and best practices for deploying, running, and scaling AI models for inference for generative AI, LLMs, recommender systems, computer vision, and more.

Select Location
Middle East