Synthetic Data Generation (SDG)

Accelerate development of physical and agentic AI workflows.

Workloads

Simulation/Modeling/Design
Robotics
Generative AI

Industries

All Industries

Business Goal

Innovation

Products

NVIDIA Omniverse Enterprise
NVIDIA AI
NVIDIA Isaac

Overview

Why Use Synthetic Data?

Training AI models requires carefully labeled, high-quality, diverse datasets to achieve the desired accuracy and performance. In many cases, data is limited, restricted, or unavailable. Collecting and labeling this real-world data is time-consuming and can be prohibitively expensive, slowing the development of various types of models, such as vision language and large-language models (LLMs).

Synthetic data—generated from a computer simulation, generative AI models, or a combination of the two—can help address this challenge. Synthetic data can consist of text, videos, and 2D or 3D images across both visual and non-visual spectra, which can be used in conjunction with real-world data to train multimodal physical AI models. This can save a significant amount of training time and greatly reduce costs.

Synthetic data, generated through simulations or AI, addresses the challenge of data scarcity by providing text, videos, and 2D/3D images that can be used alongside real data to train multimodal physical AI models, saving time and reducing costs.

AI Model Training Speed

Overcome the data gap and accelerate AI model development while reducing the overall cost of acquiring and labeling data required for model training.

Privacy and Security

Address privacy issues and reduce bias by generating diverse synthetic datasets to represent the real world.

Accuracy

Create highly accurate, generalized AI models by training with diverse data that includes rare but crucial corner cases that are otherwise impossible to collect.

Scalable

Procedurally generate data with automated pipeline data that scales with your use case across manufacturing, automotive, robotics, and more.

Synthetic Data for Physical AI Development

Physical AI models allow autonomous systems to perceive, understand, interact with, and navigate the physical world. Synthetic data is critical for training and testing physical AI models.

Foundation Model Training

World foundation models (WFMs) utilize diverse input data, including text, images, videos, and movement information, to generate and simulate virtual worlds with remarkable accuracy.   

WFMs are characterized by their exceptional generalization capabilities, requiring minimal fine-tuning for various applications. They serve as the cognitive engines for robots and autonomous vehicles, leveraging their comprehensive understanding of real-world dynamics. To achieve this level of sophistication, WFMs rely on vast amounts of training data. 

WFM development benefits significantly from generating infinite synthetic data through physically accurate simulations. This approach not only accelerates the model training process but also enhances the models' ability to generalize across diverse scenarios. Domain randomization techniques further augment this process by allowing for the manipulation of numerous parameters such as lighting, background, color, location, and environment—variations that would be nearly impossible to capture comprehensively from real-world data alone. 

Robot Policy Training

Robot learning is a collection of algorithms and methodologies that help a robot learn new skills, such as manipulation, locomotion, and classification, in either a simulated or real-world environment. Reinforcement learning, imitation learning, and diffusion policy are the key methodologies applied to train robots.  

One important skill for robots is manipulation—picking things up, sorting them, and putting them together—like you see in factories. Real-world human demonstrations are typically used as inputs for training. However, collecting a large and diverse set of data is quite expensive. With a handful of human demonstrations, developers can generate synthetic motions in simulated environments, speeding up the robot training process.

To achieve this, users can first employ GR00T-Teleop to collect a small set of human demonstrations using Apple Vision Pro (AVP). The recorded demonstrations are then used to generate a large set of synthetic motions using GR00T-Mimic. Next, they use GR00T-Gen, built on NVIDIA Omniverse™ and NVIDIA Cosmos™, for domain randomization and 3D-to-real augmentation to generate an exponentially large and diverse set of training data for imitation learning. 

Testing and Validation

Software-in-loop (SIL) is a crucial testing stage for AI-powered robots and autonomous vehicles, where the control software is tested in a simulated environment instead of on real hardware.

Synthetic data generated from simulation ensures accurate modeling of real-world physics, including sensor inputs, actuator dynamics, and environmental interactions. This also provides a way to capture rare scenarios that are dangerous to collect in the real world. This ensures that the robot software stack in simulation behaves as it would on the physical robot, allowing for thorough testing and validation without the need for physical hardware.  

Mega is an Omniverse Blueprint for developing, testing, and optimizing physical AI and robot fleets at scale in a digital twin before deployment into real-world facilities.

These simulated robots can carry out tasks by perceiving and reasoning in environments. They are capable of planning next motions and then taking actions that are simulated in the digital twin. Synthetic data from these simulations is fed back into the robot brains. The robot brains perceive the results deciding the next action, and this cycle continues with Mega precisely tracking the state and position of all the assets in the digital twin.

Synthetic Data for LLM and Agentic AI Development

Generative models can be used to bootstrap and augment synthetic data-generation processes. Text-to-3D models enable the creation of 3D assets for populating a 3D simulation scene. Text-to-image generative AI models can also be used to modify and augment existing images, either generated from simulations or collected in the real world through procedural inpainting or outpainting.

Text-to-text generative AI models, such as Evian 2 405B and Nemotron-4 340B, can be used to generate synthetic data to build powerful LLMs for healthcare, finance, cybersecurity, retail, and telecom. 

Evian 2 405B and Nemotron-4 340B provide an open license, giving developers the rights to own and use the generated data in their academic and commercial applications.

How to Build a Generative AI-Enabled SDG Pipeline

Generative AI can greatly accelerate the process of generating physically accurate synthetic data at scale. Developers can get started using generative AI for SDG with a step-by-step reference workflow.


Technical Implementation

Generating Synthetic Data

For Physical AI

  • Scene Creation: A comprehensive 3D scene serves as the foundation, incorporating essential assets like shelves, boxes, and pallets for warehouses or trees, roads, and buildings for outdoor environments. Environments can be dynamically enhanced using NVIDIA NIM™ microservices for Universal Scene Description (OpenUSD), enabling the seamless addition of diverse objects and the integration of 360° HDRI backgrounds.
  • Domain Randomization: Developers can leverage USD Code NIM, a cutting-edge LLM specialized in OpenUSD, to perform domain randomization. This powerful tool not only answers OpenUSD-related queries but also generates USD Python code to make changes in the scene, streamlining the process of programmatically altering various scene parameters within NVIDIA Omniverse.
  • Data Generation: The third step involves exporting the initial set of annotated images. Omniverse offers a wide array of built-in annotators, including 2D bounding boxes, semantic segmentation, depth maps, surface normals, and numerous others. The choice of output format, such as bounding boxes or animations, depends on the specific model requirements or use case.
  • Data Augmentation: In the final stage, developers can leverage NVIDIA Cosmos WFMs to further augment the image from 3D to Real. This brings the necessary photorealism to the generated images through simple user prompts.

For LLMs and Agentic AI

Synthetic Data
  • Access Models: Download the Nemotron-4 340B open-source family of models from the NVIDIA NGC™ catalog or Hugging Face. You can also access it via build.nvidia.com as an NVIDIA NIM microservice.
  • Domain-Specific Data Generation: Prompt the open-source Nemotron-4-340B-Instruct model to generate your custom text-based, domain-diverse, synthetic dataset mimicking real-world characteristics.
  • Evaluate and Filter: Apply the Nemotron-4 340B-Reward model to grade the generated responses based on helpfulness, correctness, coherence, complexity, and verbosity.
  • Leverage High-Quality, Relevant Synthetic Datasets: Refine the synthetic data by iteratively improving it based on the reward model's feedback, ensuring accuracy and relevance. 

Partner Ecosystem

Synthetic Data Partner Ecosystem

See how our ecosystem partners are developing their own synthetic data applications and services based on NVIDIA technologies.

Synthetic Data Companies

Digitalstates
EDGE IMPULSE
FSSTUDIO
MetAI
RENDERED.AI
roboflow
THEORY STUDIO

Service Delivery Partners

DATA MONSTERS
Deloitte
softserve

Get Started

Build your own SDG pipeline for robotics simulations, industrial inspection, and other physical AI use cases with NVIDIA Isaac Sim.

Related Use Cases

Select Location
Middle East