Synthetic data is artificially generated data used to accelerate AI model training across many domains, such as robotics and autonomous vehicles.
Synthetic data generation is the creation of text, 2D or 3D images, and videos in the visual and non-visual spectrum using computer simulations, generative AI models, or a combination of the two. This technique can be used for structured and unstructured data and is often applied to fields where original data is scarce, sensitive, or difficult to collect.
Building artificial intelligence models that deliver accuracy and performance depends on high-quality, diverse datasets with careful labeling. However, real-world data is often limited, unrepresentative of the desired sample, or unavailable due to data protection standards. Due to these limitations, acquiring and labeling original data is a time-consuming, costly process that can delay the progress of AI development.
Synthetic data addresses these challenges with artificially generated data created based on rules, algorithms, or simulations that mimic the statistical properties of real data. Developers and researchers can use this synthetic data to conduct robust testing and training of models without the constraints or privacy concerns associated with using actual data.
Synthetic data generation addresses core data science challenges, improving the training of machine learning (ML) models and streamlining AI development.
Generative AI can be used to accelerate synthetic data generation, streamlining the process of creating and iterating on virtual scenes to extract data from.
Diffusion models can generate high-quality visual content from text or image descriptions. By learning the relationships between images and the text used to describe them, diffusion models can programmatically change image parameters like layout, asset placement, color, object size, and lighting conditions.
World foundation models can generate hyper-realistic, physically accurate visual data as well. Fine-tuning world foundation models for domain-specific settings allows developers to generate simulations as videos that are highly adaptive to complex systems and environments like a factory floor.
Neural network architectures that support SDG include generative adversarial networks (GANs) and variational autoencoders (VAEs). GANs generate data through a competitive process between two neural networks, one of which generates data samples while the other evaluates them against real data.
Transformers, a type of deep learning model, are also capable of generating synthetic data. By learning complex patterns and dependencies in datasets, transformers generate entirely new data that corresponds to the existing training data. For example, in natural language processing, transformers can be used to create new textual content that mimics the style and context of a given body of text. Transformers can mimic tabular data by treating each row and column in the dataset as a sequence, learning the relationships and patterns, and generating new data that maintains the characteristics of the original dataset.
From asset creation to code generation, generative AI helps to create synthetic datasets that can be used to enhance training datasets for models in different scenarios.
Synthetic data is powering simulation-based use cases across physical AI and industrial AI.
Synthetic data is critical for training physical AI models that power humanoid robots, autonomous mobile robots (AMRs) for use in spaces such as warehouses, and industrial manipulators found in distribution centers, manufacturing sites and other industrial spaces. Synthetic data generation is used to create and annotate data from 3D simulations to enhance training datasets to train perception AI models that allow robots to accurately detect objects, avoid obstacles, and safely interact with their environment.
Synthetic data can also be used to train robot policy models that require diverse data to perform various tasks such as locomotion and manipulation.
In the automotive sector, synthetic data is needed to train the perception, planning and prediction models that power self-driving cars. Since manually collecting and labeling vast amounts of data to account for every possible traffic scenario is prohibitively expensive and time-consuming, data generated from deep learning approaches can be used to augment data collected from sensors like LiDAR, cameras, and radars. With a richer dataset, developers can optimize and validate the vehicle’s AI.
Computer vision algorithms for fixed cameras can detect, classify, and track objects to help improve safety in public spaces or industrial sites, enable automated checkout in stores, and flag product defects on assembly lines. However, collecting a large and diverse dataset of images to train accurate computer vision and automated optical inspection algorithms is a major challenge. Synthetic image data enables developers to quickly create diverse training datasets by varying parameters such as scene angle, location, lighting, and more. This enables developers to streamline the development of inspection algorithms for various use cases.
Synthetic data is powering AI across fields and use cases.
Synthetic text generation has multiple applications, from training cybersecurity models to identifying phishing emails to generating privacy-preserving medical records. For example, in the healthcare industry, data is often fragmented and kept in silos and is privacy protected, stifling technology innovation that relies on access to high-quality data. To overcome this barrier, AI can be used to generate synthetic medical datasets that accurately capture the statistical properties of real medical records but preserve the privacy of sensitive data. These datasets can be used without restriction, unlocking opportunities for medical software development for a variety of use cases.
For all of the above use cases, developers can benefit from building synthetic data vaults to store, organize, and catalog annotated data for future model training and AI projects.
NVIDIA offers a suite of technologies that help developers build SDG pipelines for use cases across industries.
The NVIDIA Omniverse™ platform provides APIs, SDKs, and services that enable developers to build or integrate Universal Scene Description (OpenUSD)-based along with Omniverse Cloud Sensor RTX into existing software tools and simulation workflows for SDG through advanced ray-tracing capabilities necessary for creating photorealistic simulations.
OpenUSD is an open-source file format and extensible framework that serves as the common language for managing different software applications and complex 3D scenes and workflows on NVIDIA Omniverse.
Omniverse can be used in conjunction with NVIDIA Cosmos™ world foundation models to augment the 3D images or animations to the necessary photorealism and further reduce the simulation to training gap.
With these tools, developers can create high-quality synthetic datasets to power a new generation of AI solutions.
For text-based synthetic data generation, NVIDIA Nemotron-4 340B provides a family of models that developers can use to generate synthetic data for training LLMs. Trained with NVIDIA NeMo and optimized with NVIDIA TensorRT-LLM, the models are available through a uniquely permissive open model license.
Nemotron-4 340B can be experienced and downloaded from the NVIDIA API catalog. Developers can use DGX Cloud to easily fine-tune AI models. More details are available in research papers on the model and the dataset.