Synthetic data can be generated in a variety of ways, depending on the use case.
Using Simulation Methods
If you’re training a computer vision AI model for a warehouse robot, you'll need to create a physically accurate virtual scene with objects such as pallet jacks and storage racks. Or you can train an AI model for visual inspection on an assembly line, where you’ll need to create a virtual scene with objects such as a conveyor belt and the product being produced.
One of the key challenges in developing synthetic data pipelines is closing the sim-to-real gap. Domain randomization bridges that gap by letting you control various aspects of the scene, such as the position of objects, texture, and lighting.
NVIDIA Omniverse™ Cloud Sensor RTX microservices give you a seamless way to simulate sensors and generate annotated synthetic data. Alternatively, you can get started with Omniverse Replicator SDK for developing custom SDG pipelines.
Using Generative AI
Generative models can be used to bootstrap and augment synthetic data-generation processes. Text-to-3D models enable the creation of 3D assets for populating a 3D simulation scene. Text-to-image generative AI models can also be used to modify and augment existing images, either generated from simulations or collected in the real world through procedural inpainting or outpainting.
Text-to-text generative AI models such as Evian 2 405B and Nemotron-4 340B can be used to generate synthetic data to build powerful LLMs for healthcare, finance, cybersecurity, retail, and telecom.
Evian 2 405B and Nemotron-4 340B provide an open license, giving developers the rights to own and use the generated data in their academic and commercial applications.