Physical AI

NVIDIA Cosmos

Accelerate physical AI development with world foundation models.

Overview

What is NVIDIA Cosmos?

NVIDIA Cosmos™ is a platform of state-of-the-art generative world foundation models (WFM), advanced tokenizers, guardrails, and an accelerated data processing and curation pipeline built to accelerate the development of physical AI systems such as autonomous vehicles (AVs) and robots.

Cosmos World Foundation Models Openly Available to Physical AI Developer Community

State-of-the-art models trained on millions of hours of driving and robotics video data to democratize physical AI development, available under open model license.

The World Foundation Model Platform to Accelerate Physical AI Development

The development of physical-AI-embodied systems such as robots and autonomous vehicles is accelerated with the new NVIDIA Cosmos platform.

Benefits

Accelerate Physical AI Development With World Foundation Models

Cosmos provides developers with open and easy access to highly performant world foundation models and data pipelines, making physical AI development accessible to all.

Physics Aware

Suite of first-generation video models trained on 9,000 trillion tokens, including 20 million hours of robotics and driving data - generating high-quality videos from multimodal inputs like images, text, or video.

Open

Cosmos WFMs and tokenizers are under NVIDIA Open Model License, enabling developers worldwide to build physical AI systems at scale without high entry costs.

Accelerate Data Processing and Curation

Speed up data curation by 20X with NVIDIA NeMo Curator pipeline of CUDA™-X and NVIDIA AI-accelerated tooling for processing over 100 PB of data. It provides out-of-the-box optimizations, minimizing the total cost of ownership (TCO) and accelerating time-to-market.

Develop Custom Models

Cosmos tokenizer converts visual data into high-fidelity tokens with 8X better compression and 12X faster processing.

NVIDIA NeMo™ delivers accelerated training and fine-tuning to build multimodal generative AI models for physical AI.

Models

NVIDIA Cosmos World Foundation Models

A family of pre-trained models purpose-built for generating physics-aware videos and world states for physical AI development.


Learn more about model architectures, development resources, and availability here.

Family of State-of-the-Art Models

  • Autoregressive and diffusion models for Text-to-World and Video-to-World generation, available in parameter sizes ranging from 4 to 14 billion to suit various needs.
  • 12-billion-parameter upsampling model for refining text prompts, delivering enhanced accuracy and detail in generated outputs.
  • 7-billion-parameter model designed for decoding video sequences, optimized for augmented reality applications.

Inbuilt Guardrails

  • Pre-guard to filter brands, NSFW content, and harmful prompts.
  • Post-guard to remove questionable scenarios.
  • Guardrail to blur human faces.
  • Digital watermarks on synthetic videos generated from Preview APIs on NVIDIA API catalog.

Benchmarks

Journey to Physical AI Performance

NVIDIA is working with the robotics and autonomous vehicle ecosystem to develop a set of benchmarks to reflect the unique requirements of physical AI applications from world foundation models.

Cosmos benchmarks are designed to evaluate the next generation of world models with advanced criteria like 3D consistency and physics alignment, essential for robotics and autonomous systems.

Compared to VideoLDM (VLDM), a baseline generative model for video synthesis, Cosmos WFMs excel in geometric accuracy with lower Sampson error and better temporal stability. Benchmarks also evaluate WFMs based on physical behaviors like gravity and collision dynamics.

Cosmos WFMs consistently outperform VLDM on visual consistency, achieving up to 14X higher pose estimation success rates. While diffusion models deliver higher fidelity out of the box, autoregressive models deliver excellent performance for custom models.

Use Cases

How Developers Use NVIDIA Cosmos

See how developers across robotics, autonomous vehicles, and vision AI can use Cosmos to advance their work.

Video Search

Cosmos helps developers build bespoke datasets for their AI model training. Whether it’s snowy road footage for self-driving cars or busy warehouse scenes for robotics, Cosmos simplifies video tagging and search by understanding spatial and temporal patterns, making training data preparation easier.

This saves time, reduces costs, and helps deliver AI models that are highly relevant and impactful for real-world use.

Controllable 3D-to-Real Synthetic Data

Developers can take advantage of their 3D simulation data to generate photoreal synthetic video. By using Omniverse, they can create 3D environments that represent their model training needs. Next, they can generate photorealistic videos that are precisely controlled by 3D scenes for highly tailored synthetic datasets.

Policy Model Training and Evaluation

Cosmos world foundation models fine-tuned for action-conditioned video prediction enable scalable and reproducible training and evaluation of policy models, which define strategies for physical AI systems, mapping states to actions. Developers use these models to reduce reliance on risky real-world tests or complex simulations for tasks such as obstacle navigation and object manipulation, optimizing performance and ensuring reliability in real-world applications like robotics and autonomous vehicles.

Foresight

Cosmos brings advanced predictive intelligence to physical AI, enabling systems to anticipate future scenarios and make smarter decisions. Through foresight generation—generating predictive videos based on past data and text prompts—Cosmos empowers physical AI to select optimal actions, enhancing efficiency, adaptability, and safety in dynamic environments.

Multiverse Simulation

Using NVIDIA Omniverse, developers can simulate multiple Cosmos outcomes to evaluate real-time scenarios, accelerating decision-making and optimizing AI-driven systems like robotics and autonomous vehicles. Together, Cosmos and Omniverse enable physical AI models to explore all possible future outcomes, selecting the best path for enhanced precision and reliability in complex environments.

Ecosystem

Adopted by Leading Physical AI Innovators

Model developers from robotics, autonomous vehicles, and vision AI industries are using Cosmos to accelerate physical AI development.

Next Steps

Ready to Get Started?

Test drive a world foundation model in the NVIDIA API catalog or start building your world models using NVIDIA Cosmos.

Build Your Custom Models

NVIDIA NeMo provides an end-to-end pipeline to curate, tokenize, and fine-tune world models on any platform.

Start Curating Video Data For World Models

Accelerated data processing and curation pipeline powered by NVIDIA NeMo Curator and optimized for NVIDIA data center GPUs.

Frequently Asked Questions

Physical AI developers can start now with Cosmos world foundation models available on NGC catalog and Hugging Face. Cosmos also provides an end-to-end pipeline to fine-tune the foundation models with NVIDIA NeMo. Developers can use Cosmos tokenizer from /NVIDIA/cosmos-tokenizer on GitHub and Hugging Face.

Cosmos world foundation models are available under an NVIDIA Open Model License for all.

Yes, Cosmos supports fine-tuning with NeMo. You can efficiently train and fine-tune models with popular techniques like LoRA and RLHF (Reinforcement Learning from Human Feedback). You can also choose PyTorch to continue training the WFMs using your own datasets.

Yes, you can leverage Cosmos to build from scratch with your preferred foundation model or model architecture. You can start by using NeMo Curator for video data preprocessing. Then compress and decode your data with Cosmos tokenizer and once you have processed the data, you can train or fine-tune your model using NVIDIA NeMo.

Using NIM microservices you can easily integrate your physical AI models in your applications across cloud, data centers, and workstations.

You can also use NVIDIA DGX Cloud to train AI models and deploy them anywhere at scale.

Cosmos and Cosmos Nemotron are both families of NVIDIA models designed to process and interpret visuals from the physical world.

Cosmos models are world foundation models that focus on predicting and generating physics-aware videos, helping to simulate and understand future states of virtual environments. In contrast, Cosmos Nemotron models are vision-language models that specialize in querying and summarizing images and video, enabling AI to interpret and respond to both physical and virtual visual data.

Together, they complement each other in enabling advanced AI capabilities grounded in visual understanding.