World Foundation Models

World foundation models (WFMs) are neural networks that simulate real-world environments as videos and predict accurate outcomes based on text, image, or video input. Physical AI developers use world foundation models to generate custom synthetic data or downstream AI models for training robots and autonomous vehicles.

What Is a World Model?

World models are generative AI models that understand the dynamics of the real world, including physics and spatial properties. They use input data, including text, image, video, and movement, to generate videos. They understand the physical qualities of real-world environments by learning to represent and predict dynamics like motion, force, and spatial relationships from sensory data.

Generative Foundation Models

Foundation models are AI neural networks trained on massive unlabeled datasets to generate new data based on input data. Because of their generalizability, they can greatly accelerate the development of a wide range of generative AI applications. Developers can fine-tune these pre-trained models on smaller, task-specific datasets for a custom domain-specific model.

Developers can tap into the power of foundation models to generate high-quality data for training AI models in industrial and robotics applications, such as factory robots, warehouse automation, and autonomous vehicles on highways or on difficult terrains. Physical AI systems require large-scale, visually, spatially and physically accurate data for learning through realistic simulations. World foundation models generate this data efficiently at scale.

There can be different types of world foundation models:

Prediction models – These models predict world generation and synthesize continuous motion based on a text prompt, input video, or by interpolating between two images. They enable realistic, temporally coherent scene generation, making them valuable for applications like video synthesis, animation, and robotic motion planning.
Style transfer models – These models guide outputs based on specific inputs using ControlNet, a model network that conditions a model’s generation based on structured guidance such as segmentation maps, LiDAR scans, depth maps, or edge detection. By mirroring input instructions visually, these models can control layout and motion while producing diverse, photorealistic results grounded in a text prompt. This makes them useful for applications requiring structured image or video synthesis, like digital twin simulations and environment reconstruction.
Reasoning models – These models analyze temporal and spatial details, reason through chains of thought, and arrive at optimal solutions for decision-making. By integrating multi-step reasoning and contextual understanding, they enhance AI’s ability to solve complex tasks, such as predicting the best robotic manipulation strategy or optimizing logistics for autonomous systems.

What Are the Real-World Applications of World Foundation Models?

World models when used with 3D simulators serve as virtual environments to safely streamline and scale training for autonomous machines. With the ability to generate, curate, and encode video data, developers can better train autonomous machines to sense, perceive, and interact with dynamic surroundings.

Autonomous Vehicles

World foundation models bring significant benefits to every stage of the autonomous vehicle (AV) pipeline. With pre-labeled, encoded video data, developers can curate and train the AV stack to recognize the behavior of vehicles, pedestrians, and objects more accurately. These models can also generate new scenarios, such as different traffic patterns, road conditions, weather, and lighting, to fill training gaps and expand testing coverage. They can also create predictive video simulations based on text and visual inputs, accelerating virtual training and testing.

Robotics

World foundation models generate photorealistic synthetic data and predictive world states to help robots develop spatial intelligence. Using virtual simulations powered by physical simulators, these models let robots practice tasks safely and efficiently, accelerating learning through rapid testing and training. They help robots adapt to new situations by learning from diverse data and experiences.

Modified world models enhance planning by simulating object interactions, predicting human behavior, and guiding robots to reach goals accurately. They also improve decision-making by running multiple simulations and learning from feedback. With virtual simulations, developers can reduce real-world testing risks, cutting time, costs, and resources.

What Are the Benefits of World Foundation Models?

Building a world model for a physical AI system like a self-driving car, is resource-and time-intensive. First, gathering real-world datasets from driving around the globe in various terrains and conditions requires petabytes of data, time, and millions of hours of simulation footage. Next, filtering and preparing this data demands thousands of hours of human effort. Finally, training these large models costs millions of dollars in GPU compute and requires many GPUs.

World foundation models aim to capture the underlying structure and dynamics of the world, enabling more sophisticated reasoning and planning capabilities. Trained on vast amounts of curated, high-quality, real-world data, these neural networks serve as visually, spatially, and physically aware synthetic data generators for physical AI systems.

World foundation models allow developers to extend generative AI beyond the confines of 2D software and bring its capabilities into the real world while reducing the need for real-world trials. While AI’s power has traditionally been harnessed in digital domains, world models will unlock AI for tangible, real-world experiences.

Realistic Video Generation

World models can create more realistic and physically accurate visual content by understanding the underlying principles of how objects move and interact. These models can generate realistic 3D worlds on demand for many uses, including video games and interactive experiences. In certain cases, outputs from highly accurate world models can take the form of synthetic data, which can be leveraged for training perception AI.

Current AI video generation can struggle with complex scenes and has limited understanding of cause and effect. However, world models paired with 3D simulation platforms and software are showing the potential to demonstrate a deeper understanding of cause and effect in visual scenarios, such as simulating a painter leaving brush strokes on a canvas.

Predictive Intelligence

World foundation models help physical AI systems learn, adapt, and make better decisions by simulating real-world actions and predicting outcomes. They allow systems to "imagine" different scenarios, test actions, and learn from virtual feedback—just like a self-driving car practicing in a simulator to handle sudden obstacles or bad weather. By predicting possible outcomes, an autonomous machine can plan smarter actions without needing real-world trials, saving time and reducing risk.

When combined with large language models (LLMs), world models help AI understand instructions in natural language and interact more effectively. For example, a delivery robot could interpret a spoken request to "find the fastest route" and simulate different paths to determine the best one.

This predictive intelligence makes physical AI models more efficient, adaptable, and safer—helping robots, autonomous vehicles, and industrial machines operate smarter in complex, real-world environments.

Improved Policy Learning

Policy learning entails exploring strategies to find the best actions. A policy model helps a system, like a robot, decide the best action to take based on its current state and the broader state of the world. It links the system’s state (e.g., position) to an action (e.g., movement) to achieve a goal or improve performance. A policy model can be derived from fine-tuning a model. Policy models are commonly used in reinforcement learning, where they learn through interaction and feedback.

Optimizing for Efficiency and Feasibility

World models help explore multiple strategies, rewarding the most effective outcomes to improve decision-making. Developers can add a reward module and run simulations to test and refine approaches - such models are called cost models. These models track resource usage, ensuring models are both effective and efficient. Together, these systems accelerate learning and optimize performance for real-world tasks.

How Are World Models Built?

World models require extensive real-world data, particularly video and images, to learn dynamic behaviors in 3D environments. Neural networks with billions of parameters analyze this data to create and update a hidden state or an internal representation of the environment. This enables robots to understand and predict changes, such as perceiving motion and depth from videos, predicting hidden objects, and preparing to react to events that might happen. Continuous improvement of the hidden state through deep learning allows world models to adapt to new scenarios.

Here are some of the key components for building world models:

Data Curation

Data curation is a crucial step for pretraining and continuous training of world models, especially when working with large-scale multimodal data. It involves processing steps like filtering, annotation, classification, and deduplication of image or video data to ensure high quality when training or fine-tuning highly accurate models.

In video processing, data curation starts with splitting and transcoding the video into smaller segments, followed by quality filtering to retain the high-quality data. State-of-the-art vision language models are used to annotate key objects or actions, while video embeddings help semantic deduplication to remove redundant data.

The data is then organized and cleaned for training. Throughout this process, efficient data orchestration ensures smooth data flow among the GPUs to handle large-scale data and achieve high throughput.

Automatic Speech Recognition (ASR) System

Tokenization

Tokenization converts high-dimensional visual data into smaller units called tokens, facilitating machine learning processing. Tokenizers transform pixel redundancies in images and video into compact, semantic tokens, enabling efficient training of large-scale generative models and inference on limited resources. There are two main methods:

Discrete tokenization: Represents images and videos as integers.
Continuous tokenization: Represents images and videos as continuous vectors.

This approach enhances model learning speed and performance.

Fine-Tuning World Foundation Model

Foundation models are AI neural networks trained on vast unlabeled datasets to perform various generative tasks. Developers can train a model architecture from scratch or fine-tune a pretrained foundation model for downstream tasks using additional data.

World foundation models serve as generalist models, trained on extensive visual datasets to simulate physical environments. Using fine-tuning frameworks, these models can be specialized for precise applications in robotics, autonomous systems, and other physical AI domains. There can be multiple approaches to fine-tune a model.

Unsupervised fine-tuning – Involves adapting a model using unlabeled data, allowing it to learn representations and patterns from new datasets without explicit labels. This method is useful for broad generalization and domain adaptation.
Supervised fine-tuning – Uses labeled datasets where the model is explicitly guided to learn task-specific features. This approach enhances decision-making, improves structured pattern recognition, and ultimately develops reasoning capabilities for more complex AI-driven applications.

To get started easily and streamline the end-to-end development process, developers can leverage training frameworks, which include libraries, SDKs, and tools for data preparation, model training, optimization, and performance evaluation and deployment.

Reinforcement Learning

Reasoning models are trained by fine-tuning pre-trained large language models or large vision language models. They also use reinforcement learning to analyze and reason themselves before they reach a decision.

Reinforcement learning (RL) is a machine learning approach where an AI agent learns by interacting with an environment and receiving rewards or penalties based on its actions. Over time, it optimizes decision-making to achieve the best possible outcome.

Reinforcement learning enables WFM to adapt, plan, and make informed decisions, making it essential for robotics, autonomous systems, and AI assistants that need to reason through complex tasks.

Learn more about reinforcement learning here.

How to Get Started With World Foundation Models

NVIDIA Cosmos

NVIDIA Cosmos™, a platform of state-of-the-art generative world foundation models, advanced tokenizers, guardrails, and an accelerated data processing and curation pipeline built to accelerate the development of physical AI systems such as autonomous vehicles (AVs) and robots.

Learn More

Cosmos World Foundation Models

A family of pre-trained models purpose-built for generating physics-aware videos and world states for physical AI development.

Try Now

NVIDIA Isaac GR00T

NVIDIA Isaac GR00T is an active research initiative and development platform designed to accelerate humanoid robotics. It includes a collection of robotics foundation models, workflows, and simulation tools.

Learn More