World foundation models (WFMs) are neural networks that simulate real-world environments as videos and predict accurate outcomes based on text, image, or video input. Physical AI developers use world foundation models to generate custom synthetic data or downstream AI models for training robots and autonomous vehicles.
World models are generative AI models that understand the dynamics of the real world, including physics and spatial properties. They use input data, including text, image, video, and movement, to generate videos. They understand the physical qualities of real-world environments by learning to represent and predict dynamics like motion, force, and spatial relationships from sensory data.
Foundation models are AI neural networks trained on massive unlabeled datasets to generate new data based on input data. Because of their generalizability, they can greatly accelerate the development of a wide range of generative AI applications. Developers can fine-tune these pre-trained models on smaller, task-specific datasets for a custom domain-specific model.
Developers can tap into the power of foundation models to generate high-quality data for training AI models in industrial and robotics applications, such as factory robots, warehouse automation, and autonomous vehicles on highways or on difficult terrains. Physical AI systems require large-scale, visually, spatially and physically accurate data for learning through realistic simulations. World foundation models generate this data efficiently at scale.
There can be different types of world foundation models:
World models when used with 3D simulators serve as virtual environments to safely streamline and scale training for autonomous machines. With the ability to generate, curate, and encode video data, developers can better train autonomous machines to sense, perceive, and interact with dynamic surroundings.
World foundation models bring significant benefits to every stage of the autonomous vehicle (AV) pipeline. With pre-labeled, encoded video data, developers can curate and train the AV stack to recognize the behavior of vehicles, pedestrians, and objects more accurately. These models can also generate new scenarios, such as different traffic patterns, road conditions, weather, and lighting, to fill training gaps and expand testing coverage. They can also create predictive video simulations based on text and visual inputs, accelerating virtual training and testing.
World foundation models generate photorealistic synthetic data and predictive world states to help robots develop spatial intelligence. Using virtual simulations powered by physical simulators, these models let robots practice tasks safely and efficiently, accelerating learning through rapid testing and training. They help robots adapt to new situations by learning from diverse data and experiences.
Modified world models enhance planning by simulating object interactions, predicting human behavior, and guiding robots to reach goals accurately. They also improve decision-making by running multiple simulations and learning from feedback. With virtual simulations, developers can reduce real-world testing risks, cutting time, costs, and resources.
Building a world model for a physical AI system like a self-driving car, is resource-and time-intensive. First, gathering real-world datasets from driving around the globe in various terrains and conditions requires petabytes of data, time, and millions of hours of simulation footage. Next, filtering and preparing this data demands thousands of hours of human effort. Finally, training these large models costs millions of dollars in GPU compute and requires many GPUs.
World foundation models aim to capture the underlying structure and dynamics of the world, enabling more sophisticated reasoning and planning capabilities. Trained on vast amounts of curated, high-quality, real-world data, these neural networks serve as visually, spatially, and physically aware synthetic data generators for physical AI systems.
World foundation models allow developers to extend generative AI beyond the confines of 2D software and bring its capabilities into the real world while reducing the need for real-world trials. While AI’s power has traditionally been harnessed in digital domains, world models will unlock AI for tangible, real-world experiences.
World models can create more realistic and physically accurate visual content by understanding the underlying principles of how objects move and interact. These models can generate realistic 3D worlds on demand for many uses, including video games and interactive experiences. In certain cases, outputs from highly accurate world models can take the form of synthetic data, which can be leveraged for training perception AI.
Current AI video generation can struggle with complex scenes and has limited understanding of cause and effect. However, world models paired with 3D simulation platforms and software are showing the potential to demonstrate a deeper understanding of cause and effect in visual scenarios, such as simulating a painter leaving brush strokes on a canvas.
World foundation models help physical AI systems learn, adapt, and make better decisions by simulating real-world actions and predicting outcomes. They allow systems to "imagine" different scenarios, test actions, and learn from virtual feedback—just like a self-driving car practicing in a simulator to handle sudden obstacles or bad weather. By predicting possible outcomes, an autonomous machine can plan smarter actions without needing real-world trials, saving time and reducing risk.
When combined with large language models (LLMs), world models help AI understand instructions in natural language and interact more effectively. For example, a delivery robot could interpret a spoken request to "find the fastest route" and simulate different paths to determine the best one.
This predictive intelligence makes physical AI models more efficient, adaptable, and safer—helping robots, autonomous vehicles, and industrial machines operate smarter in complex, real-world environments.
Policy learning entails exploring strategies to find the best actions. A policy model helps a system, like a robot, decide the best action to take based on its current state and the broader state of the world. It links the system’s state (e.g., position) to an action (e.g., movement) to achieve a goal or improve performance. A policy model can be derived from fine-tuning a model. Policy models are commonly used in reinforcement learning, where they learn through interaction and feedback.
World models help explore multiple strategies, rewarding the most effective outcomes to improve decision-making. Developers can add a reward module and run simulations to test and refine approaches - such models are called cost models. These models track resource usage, ensuring models are both effective and efficient. Together, these systems accelerate learning and optimize performance for real-world tasks.
World models require extensive real-world data, particularly video and images, to learn dynamic behaviors in 3D environments. Neural networks with billions of parameters analyze this data to create and update a hidden state or an internal representation of the environment. This enables robots to understand and predict changes, such as perceiving motion and depth from videos, predicting hidden objects, and preparing to react to events that might happen. Continuous improvement of the hidden state through deep learning allows world models to adapt to new scenarios.
Here are some of the key components for building world models:
Data curation is a crucial step for pretraining and continuous training of world models, especially when working with large-scale multimodal data. It involves processing steps like filtering, annotation, classification, and deduplication of image or video data to ensure high quality when training or fine-tuning highly accurate models.
In video processing, data curation starts with splitting and transcoding the video into smaller segments, followed by quality filtering to retain the high-quality data. State-of-the-art vision language models are used to annotate key objects or actions, while video embeddings help semantic deduplication to remove redundant data.
The data is then organized and cleaned for training. Throughout this process, efficient data orchestration ensures smooth data flow among the GPUs to handle large-scale data and achieve high throughput.
Tokenization converts high-dimensional visual data into smaller units called tokens, facilitating machine learning processing. Tokenizers transform pixel redundancies in images and video into compact, semantic tokens, enabling efficient training of large-scale generative models and inference on limited resources. There are two main methods:
This approach enhances model learning speed and performance.
Foundation models are AI neural networks trained on vast unlabeled datasets to perform various generative tasks. Developers can train a model architecture from scratch or fine-tune a pretrained foundation model for downstream tasks using additional data.
World foundation models serve as generalist models, trained on extensive visual datasets to simulate physical environments. Using fine-tuning frameworks, these models can be specialized for precise applications in robotics, autonomous systems, and other physical AI domains. There can be multiple approaches to fine-tune a model.
To get started easily and streamline the end-to-end development process, developers can leverage training frameworks, which include libraries, SDKs, and tools for data preparation, model training, optimization, and performance evaluation and deployment.
Reasoning models are trained by fine-tuning pre-trained large language models or large vision language models. They also use reinforcement learning to analyze and reason themselves before they reach a decision.
Reinforcement learning (RL) is a machine learning approach where an AI agent learns by interacting with an environment and receiving rewards or penalties based on its actions. Over time, it optimizes decision-making to achieve the best possible outcome.
Reinforcement learning enables WFM to adapt, plan, and make informed decisions, making it essential for robotics, autonomous systems, and AI assistants that need to reason through complex tasks.
Learn more about reinforcement learning here.