AI Inference
Scale and Serve Generative AI, Fast.
NVIDIA Dynamo is an open-source modular inference framework for serving generative AI models in distributed environments. It enables seamless scaling of inference workloads across large GPU fleets with dynamic resource scheduling, intelligent request routing, optimized memory management, and accelerated data transfer.
When serving the open-source DeepSeek-R1 671B reasoning model on NVIDIA GB200 NVL72, NVIDIA Dynamo increased the number of requests served by up to 30x, making it the ideal solution for AI factories looking to run at the lowest possible cost to maximize token revenue generation.
NVIDIA Dynamo supports all major AI inference backends and features large language model (LLM)-specific optimizations, such as disaggregated serving, accelerating and scaling AI reasoning models at the lowest cost and with the highest efficiency. It will be supported as a part of NVIDIA AI Enterprise in a future release.
Separates LLM context (prefill) and generation (decode) phases across distinct GPUs, enabling tailored model parallelism and independent GPU allocation to increase requests served per GPU.
Monitors GPU capacity in distributed inference environments and dynamically allocates GPU workers across context and generation phases to resolve bottlenecks and optimize performance.
Routes inference traffic efficiently, minimizing costly recomputation of repeat or overlapping requests to preserve compute resources while ensuring balanced load distribution across large GPU fleets.
Accelerates data movement in distributed inference settings while simplifying transfer complexities across diverse hardware, including GPUs, CPUs, networks, and storage.
Streamline and automate GPU cluster setup with prebuilt, easy-to-deploy tools and enable dynamic autoscaling with real-time LLM-specific metrics, avoiding over or under provisioning of GPU resources.
Leverage advanced LLM inference serving optimizations like disaggregated serving to increase the number of inference requests served without compromising user experience.
Open and modular design allows you to easily pick and choose the inference-serving components that suit your unique needs, ensuring compatibility with your existing AI stack and avoiding costly migration projects.
NVIDIA Dynamo’s support for all major frameworks—including TensorRT-LLM, vLLM, SGLang, PyTorch, and more—ensures your ability to quickly deploy new generative AI models, regardless of their backend.
NVIDIA Dynamo is fully open source, giving you complete transparency and flexibility. Deploy NVIDIA Dynamo, contribute to its growth, and seamlessly integrate it into your existing stack.
Check it out on GitHub and join the community!
Find out how you can drive innovation with NVIDIA Dynamo.
Reasoning models generate more tokens to solve complex problems, increasing inference costs. NVIDIA Dynamo optimizes these models with features like disaggregated serving. This approach separates the prefill and decode computational phases onto distinct GPUs, allowing AI inference teams to optimize each phase independently. The result is better resource utilization, more queries served per GPU, and lower inference costs.
Download on GitHub and join the community!
Explore everything you need to start developing with NVIDIA Dynamo, including the latest documentation, tutorials, technical blogs, and more.
Talk to an NVIDIA product specialist about moving from pilot to production with the security, API stability, and support of NVIDIA AI Enterprise.