Synthetic data can be generated in a variety of ways, depending on the use case.
Using Simulation Methods
The following are steps for generating synthetic data to train perception AI models:
- Build the digital twin: Create an accurate 3D virtual replica by importing CAD models via NVIDIA Omniverse™ connectors, adding SimReady assets for realism, and matching real-world scale and lighting.
- Randomize the domain: Vary object positions and orientations, randomize textures and lighting conditions, alter camera parameters, and introduce occlusions and distractors to create diverse synthetic training data.
- Simulate the scenarios: Implement physics-based behaviors, program object interactions, simulate sensors like cameras and LiDAR, and create multiple scenario variations to test diverse conditions.
- Generate images: Render multi-view images, export ground-truth annotations, create depth maps and segmentation masks, and produce large, diverse datasets.
- Validate and refine: Validate your model with real data, then adjust randomization parameters accordingly, blend synthetic and real datasets, and iterate until the desired KPIs are achieved.
NVIDIA Omniverse Cloud Sensor RTX™ microservices give you a seamless way to simulate sensors and generate annotated synthetic data. Alternatively, you can get started with Omniverse Replicator SDK for developing custom SDG pipelines.
Using Generative AI
Generative models can be used to bootstrap and augment synthetic data-generation processes. Text-to-3D models enable the creation of 3D assets for populating a 3D simulation scene. Text-to-image generative AI models can also be used to modify and augment existing images, either generated from simulations or collected in the real world through procedural inpainting or outpainting.
Text-to-text generative AI models, such as Evian 2 405B and Nemotron-4 340B, can be used to generate synthetic data to build powerful large language models (LLMs) for healthcare, finance, cybersecurity, retail, and telecom.
Evian 2 405B and Nemotron-4 340B provide an open license, giving developers the rights to own and use the generated data in their academic and commercial applications.