World Foundation Models

World foundation models are neural networks that simulate real-world environments and predict accurate outcomes based on text, image, or video input. Physical AI systems like robots and autonomous vehicles (AVs) use world foundation models to accelerate training and testing.

What Is a World Model?

World models are generative AI models that understand the dynamics of the real world, including physics and spatial properties. They understand the physical qualities of real-world environments by learning to represent and predict dynamics like motion, force, and spatial relationships from sensory data.

Generative Foundation Models

Foundation models are AI neural networks trained on massive unlabeled datasets that can accomplish a broad range of tasks. Because of their generalizability, they can greatly accelerate the development of a wide range of generative AI applications. Developers can fine-tune the foundation model on specific datasets, customizing and iterating generative AI applications much faster than was previously possible. 

With world foundation models, developers can tap into the power of foundation models to build world models for downstream applications or specific domains, such as a factory floor, warehouse, or highway. This is critical for developing physical AI systems, which require visually, spatially, and physically accurate data to learn.

What Are the Real-World Applications of World Foundation Models?

World models serve as virtual environments to safely streamline and scale training for autonomous machines. With the ability to generate, curate, and encode video data, developers can better train autonomous machines to sense, perceive, and interact with dynamic surroundings.

Autonomous Vehicles

World foundation models bring significant benefits to every stage of the autonomous vehicle (AV) pipeline. With pre-labeled, encoded video data, developers can easily curate and more accurately train the AV stack to understand the intent of surrounding vehicles, pedestrians, and objects. World models can also generate new scenarios, including pedestrians, traffic, and road conditions, helping address gaps in training or scale testing to new locations.

Robotics

World foundation models help robots build spatial intelligence capabilities by simulating virtual environments for them to learn from. By utilizing simulated environments, these models enhance data efficiency and allow for rapid iterations and simultaneous training processes. This not only speeds up the learning curve for the robot but also ensures safety by enabling explorations in a controlled setting.

World foundation models contribute to better generalization and adaptability by integrating various input modalities, supporting transfer learning, and adjusting to environmental changes. They empower robots to master complex tasks through advanced planning over extended horizons, simulating interactions with objects, and predicting human behaviors. Furthermore, they optimize policy learning through simulated scenarios and the use of actor-critic methods.

What Are the Benefits of World Foundation Models?

Building a world model for a physical AI system, like a self-driving car, is resource-and time-intensive. First, gathering real-world datasets from driving around the globe in various terrains and conditions requires petabytes of data, time, and millions of hours of simulation footage. Next, filtering and preparing this data demands thousands of hours of human effort. Finally, training these large models costs millions of dollars in GPU compute and requires many GPUs.

World foundation models aim to capture the underlying structure and dynamics of the world, enabling more sophisticated reasoning and planning capabilities. Trained on vast amounts of curated, high-quality, real-world data, these neural networks serve as powerful physical simulators and synthetic data generators for physical AI systems.

World foundation models allow developers to extend generative AI beyond the confines of 2D software and bring its capabilities into the real world in the form of physical AI. While AI’s power has traditionally been harnessed in digital domains, world models will unlock AI for tangible, real-world experiences.

Realistic Video Generation

World models can create more realistic and physically accurate visual content by understanding the underlying principles of how objects move and interact. These models have the potential to generate realistic 3D worlds on demand for many uses, including video games and interactive experiences. In certain cases, outputs from highly accurate world models can take the form of synthetic data, which can be leveraged for training perception AI.

Current AI video generation can struggle with complex scenes and has limited understanding of cause and effect. But world models are showing the potential to demonstrate a deeper understanding of cause and effect in visual scenarios, such as simulating a painter leaving brush strokes on a canvas.

Scene Adaptability

World models can learn to predict intricate physical phenomena with remarkable accuracy, potentially outperforming traditional simulation methods. World models excel at handling non-linear and chaotic systems that often challenge conventional physics-based simulations. Once trained, these models can perform simulations more efficiently than traditional methods, particularly when dealing with complex systems. Furthermore, world models demonstrate exceptional adaptability, as they can learn to simulate a variety of physical systems without the need for explicit programming of physical laws for each scenario.

Enhanced Generalization and Decision Making

World models enable physical AI systems to learn and adapt to different environments by testing actions and receiving feedback. By learning from training data, agents can reduce the need for real-world interaction, improving sample efficiency. This allows agents to "imagine" and plan future actions by simulating potential outcomes, leading to more informed decision making. Additionally, understanding the environment's dynamics helps agents generalize to new situations and explore more efficiently, as they can evaluate potential action sequences without real-world execution.

Integrating large language models (LLMs) with world models can bring semantic understanding, allowing the system to interpret and generate human-like language and additional multimodal capabilities, enabling more comprehensive interaction with the environment.

Improved Policy Learning

Policy learning entails exploring strategies to find the best actions. A policy model helps a system, like a robot, decide the best action to take based on its current state and the broader state of the world. It links the system’s state (e.g., position) to an action (e.g., movement) to achieve a goal or improve performance. A policy model can be derived from fine-tuning a model. Policy models are commonly used in reinforcement learning, where they learn through interaction and feedback.

Fine-Tuning for Future Trajectories

Fine-tuning a pretrained model involves adapting a model previously trained on a large dataset to a specific task or domain by training it further on a smaller, task-specific dataset. Fine-tuning can also be used to improve quality with additional high-quality data. 

For world foundation models, fine-tuning enables future trajectory predictions with high accuracy. By learning from past data, these models can anticipate how objects or systems will behave over time, enabling more precise planning and control. This capability is crucial for applications like autonomous driving, where predicting the movement of other vehicles and pedestrians is essential for safe navigation.

Optimizing for Efficiency and Feasibility

Cost models within world foundation models help in evaluating the efficiency and feasibility of different actions or strategies. By simulating various scenarios, these models can estimate the costs associated with different decisions, such as energy consumption, time, or resources. This information is invaluable for optimizing operations and making cost-effective choices in real-world applications.

How Are World Models Built?

World models require extensive real-world data, particularly video and images, to learn dynamic behaviors in 3D environments. Neural networks with billions of parameters analyze this data to create and update a hidden state or an internal representation of the environment. This enables robots to understand and predict changes, such as perceiving motion and depth from videos, predicting hidden objects, and preparing to react to events that might happen. Continuous improvement of the hidden state through deep learning allows world models to adapt to new scenarios.

Here are some of the key components for building world models:

Data Curation

Data curation is a crucial step in pretraining and continuous training of world models, especially when working with large-scale multimodal data. It involves processing steps like filtering, annotation, classification, and deduplication of image or video data to ensure high quality when training or fine-tuning highly accurate models.

In video processing, this starts with splitting and transcoding the video into smaller segments, followed by quality filtering to retain the high-quality data. State-of-the-art vision language models are used to annotate key objects or actions, while video embeddings help semantic deduplication to remove redundant data.

The data is then organized and cleaned for training. Throughout this process, efficient data orchestration ensures smooth data flow among the GPUs to handle large-scale data and achieve high throughput.

Tokenization

Tokenization converts high-dimensional visual data into smaller units called tokens, facilitating machine learning processing. Tokenizers transform pixel redundancies in images and video into compact, semantic tokens, enabling efficient training of large-scale generative models and inference on limited resources. There are two main methods:

  • Discrete tokenization: Represents images and videos as integers.
  • Continuous tokenization: Represents images and videos as continuous vectors.

This approach enhances model learning speed and performance.

Foundation Model Architecture

Foundation models are AI neural networks trained on vast unlabeled datasets to perform various generative tasks. Developers can train a model architecture from scratch or fine-tune a pretrained foundation model for downstream tasks using additional data.

World foundation models are built by training visual generative AI architectures on extensive visual datasets for physical AI. They can use two key architectures for video generation.

  • Diffusion model: Starts with random noise, gradually refining it to generate high-quality video. It excels in tasks like video generation and style transfer.
  • Autoregressive model: Generates video one frame at a time, predicting the next frame based on previous ones. It's ideal for predicting future frames or completing video sequences.

To get started easily and streamline the end-to-end development process, developers can leverage training frameworks, which include libraries, SDKs, and tools for data preparation, model training, optimization, and performance evaluation and deployment.

How to Get Started With World Foundation Models

NVIDIA NeMo

NVIDIA NeMo™ enables end-to-end development of multimodal generative AI models, from data curation to deployment. NeMo Curator accelerates world model development with efficient, GPU-powered data processing for tasks such as downloading, extracting, cleaning, and filtering data. 

Cosmos Tokenizer

Cosmos Tokenizer enables efficient image and video tokenization to streamline world model development, with pretrained models and inference codes readily available on GitHub.

NVIDIA Project GR00T

An active research initiative to accelerate humanoid robot development, NVIDIA Project GR00T is a collection of robotics foundation models, workflows, and simulation tools.