Scaling Action Recognition Models with Synthetic Data

Action recognition models such as PoseClassificationNet have been around for some time, helping systems identify and classify human actions like walking, waving, or picking up objects. While the concept is well-established, the challenge lies in building a robust computer vision model that can accurately recognize the range of actions across different scenarios that are domain- or use case–specific.

One of the key hurdles is acquiring a robust amount of training data, adding the classes that are needed for a unique use case, and training such a model effectively. Instead of relying solely on real-world data, which can be time-consuming and expensive to collect, synthetic data generation (SDG) is quickly becoming an effective and practical solution.

SDG is the process of creating artificial data from physically accurate 3D simulations that mimic real-world data. The model training process is iterative and often requires more data to cover specific scenes, new classes, and creating diverse scenes, ensuring that the model evolves efficiently.

This post discusses the steps used to create synthetic data using NVIDIA Isaac Sim, a reference application built on NVIDIA Omniverse for simulating and validating robots, for multiple domains: Retail, Sports, Warehouse, and Hospital.

We customized the PoseClassificationNet action recognition model with NVIDIA TAO fine-tuning capabilities and tested it on real data. The outlined steps apply to many other use cases applied to create synthetic data for other domains or use cases.

Creating a human action recognition video dataset with Isaac Sim

To begin using Isaac Sim, start with the Isaac Sim Hello World video. To create the action recognition model, you need actions such as picking up an apple. From actions, you can extract key points and that is the input for the AR model. You can obtain action animations from any third-party vendor or create these animations using real videos.

A diagram shows starting from action animation to NVIDIA Isaac Sim, preprocess animation USD, set ORA config options, and set advanced options such as custom code. Before deployment, Isaac Sim offers SetCameraAngle, Scene Generation, and Set Data Config. — *Figure 1. Data simulation pipeline for action recognition*

Omni.Replicator.Agent (ORA), an Isaac Sim extension, is designed to generate synthetic data on human characters and robots across a variety of 3D environments. The ORA extension offers the following features:

Multi-camera consistency
Multi-sensor logging
Custom DataWriter support (skeletal data, 2D position, and segmentation)
Position and orientation randomization for characters, agents, and objects

Select SimReady assets and environments in Omniverse

Isaac Sim has more than a thousand SimReady assets that can be used for your 3D simulation. Environments such as hospital scenes, warehouse digital twins, and retail stores are among some that have been designed, with a selection of over 10K usable assets. To display the selection available, choose Windows > Browsers > Assets.

You can also create your customized assets and environments in Omniverse. For more information, see Environmental setup.

A picture shows a sample warehouse Omniverse environment with warehouse floor, aisles, boxes, and forklift. — *Figure 2. Omniverse environment example*

Figure 3 shows some SimReady assets.

A picture shows an array of boxes, pallets, barrels, fuel cans, and machinery. — *Figure 3. SimReady asset examples available in Isaac Sim*

Set the ORA Core extension configuration file

Every job uses a config file for specifying the path to the scene and character assets that must be added to the scene. More properties can be added for access when the agent core SDG extension loads job information.

omni.replicator.agent:
  character:
    asset_path: http://omniverse-content-production.s3-us-west-2.amazonaws.com/Assets/Isaac/4.2/Isaac/People/Characters/
    num: 1
  global:
    camera_num: 4
    seed: 1777061627
  replicator:
    parameters:
      bbox: true
      output_dir: /media/scratch.metropolis2/sdg_data_action_recognition/sdg_warehouse/warehouse_aisle_walking_f_0
      rgb: true
      video: true
    writer: ActionWriter
  scene:
    asset_path: http://omniverse-content-production.s3-us-west-2.amazonaws.com/Assets/Isaac/4.2/Isaac/Environments/Simple_Warehouse/full_warehouse.usd
  version: 0.1.0

You can specify character assets, generation settings, output configurations, and scene environments in the replicator configuration, with options for custom data recording and various output modes.

Configure and place cameras

Use the extension multi-view camera consistencies by setting the camera count (camera_num property in the configuration file) to the desired number of views and manually placing cameras in the scene.

Customize the ORA extension (advanced)

We provide the source code for the ORA extension as visualized in the directory tree, which is found under the file path, /isaac-sim-4.0.0-rc.20/extscache/omni.replicator.agent.core-0.2.3.

|-- data_generation.py
|-- randomization
|   |-- camera_randomizer.py
|   |-- character_randomizer.py
|-- simulation.py

The ORA extension is written in Python. Here is a breakdown of the main modifiable files and folders in the ORA extension:

simulation.py: Contains code (SimulationManager class) to open the context stage and refresh the job scene to start different counts of jobs.
data_generation.py: Loads a config file when prompted by SimulationManager and starts recording simulation data asynchronously.
/randomization: Folder where camera and character spawning properties (rotation and position ranges) can be logically rewritten.
/writers: Location to add custom writers to record different types of data and store in the /output_dir folder (segmentation maps, custom skeleton data, and so on).

The refresh_auto_job_anim function in simulation.py contains the callbacks to data generation functions for initiating simulations.

An example of programming a new action for a subject might look like specifying that the character walks, then on the next run, the character sits. For this, you can add custom logic to refresh_auto_job_anim. This rewrites the relationships between animation and character primitive (Prim) USD objects before every re-render of a stage. For more information, see Prim.

The following code shows an example implementation, where a new character is spawned and a new action is programmed into their animation sequence on every refresh of the scene.

<simulation.py>
def refresh_auto_job_anim(self, num):
///... 
stage = omni.usd.get_context().get_stage()
#Get the animation graph for the current character in the scene
        anim_prims = stage.GetPrimAtPath("/World/Characters/Biped_Setup/Animations").GetAllChildren()

#Get the new animation to attach to the character in the scene
        curr_paths = []
        for i in pick_index_list:
            curr_paths.append(anim_prims[i])
        if self.yaml_data["character"]["animation_name"] == '':
            pick_prim = anim_prims[self.yaml_data["character"]["animation_num"]]
            pick_name = str(pick_prim.GetPrimPath()).split('/')[-1]
            new_anim_graph_node = "/World/Characters/Biped_Setup/AnimationGraph/" + pick_name
        else:
            pick_prim = None
            for prim in anim_prims:
                if self.yaml_data["character"]["animation_name"] in str(prim.GetPrimPath()).split('/')[-1]:
                    pick_prim = prim
            new_anim_graph_node = "/World/Characters/Biped_Setup/AnimationGraph/" + 
str(prim.GetPrimPath()).split('/')[-1]

	  #Code to attach the new animation Prim to the character in the scene
        omni.kit.commands.execute("CreatePrimCommand", 
                                  prim_type = "AnimationClip",
                                  prim_path = new_anim_graph_node,
                                  select_new_prim = True)
        omni.kit.commands.execute("AnimGraphUISetNodePositionCommand",
                                  prim = stage.GetPrimAtPath(new_anim_graph_node),
                                  position_attribute_name="ui:position",
                                  value=(-331, 57))
        omni.kit.commands.execute("AnimGraphUISetRelationshipTargetsCommand",
                                  relationship = stage.GetPrimAtPath(new_anim_graph_node).GetRelationship("inputs:animationSource"),
                                  targets=[pick_prim.GetPrimPath()])
        omni.kit.commands.execute("AddRelationshipTargetCommand",

You can add custom functions to the pipeline to aid with randomization. For instance, you can retrieve and decode the USD format of animations on Reallusion to quickly determine the shortest loop period, enabling you to customize the simulation length for data generated on each run. The following code example is an implementation of this.

<simulation.py>
def find_times(self):
        stage = omni.usd.get_context().get_stage()
        anim_prims = stage.GetPrimAtPath("/World/Characters/Biped_Setup/Animations").GetAllChildren()
        times = {}
        for anim in anim_prims:
            counter = 0
            attr = anim.GetAttribute("translations")
            prev_val = None
            curr_val = attr.Get(counter)
            while prev_val != curr_val:
                counter += 1
                prev_val = curr_val
                curr_val = attr.Get(counter)
            if(attr.Get(counter + 1) == attr.Get(counter + 2)):
                times[anim] = counter
        return times

More settings for camera randomization can be adjusted in the settings.py file within the ORA extension. Properties such as character_focus can be set to True, and parameters such as character_radius can be modified to spawn cameras relative to character positions, ensuring that characters remain in view and are not occluded during data generation.

An image of a person standing shows the various ways to randomize various parameters of the cameras, such as location from the subject, height from the ground, its projection angle, and more. — *Figure 4. Parameters used to control camera randomization and placement in Isaac Sim configuration files*

Isaac Sim can be executed headlessly in a container after configurations are set up and optional modifications to the extension code have been made.

A diagram shows Omni.Replicator.Agent.Core at the center, above Isaac Sim and Replicator. The top two layers are agent core SDG features such as character randomization and camera angle. The user’s key modifications to control the data generation include custom animation length or positioning the character in the frame. — *Figure 5. Modifications made to an Isaac Sim application and the portion containerized for deployment*

The entire Isaac Sim application instance can be containerized by pulling the desired Isaac Sim container from NGC and migrating extension modifications to the /extscache subfolder in the container, as described earlier.

./python.sh tools/isaac_people/sdg_scheduler.py -c 
/isaac-sim/curr_sdg_data/mount_sports_config_files/{filename} -n 1

Enable large-scale data generation

To help scale and orchestrate the data generation process, you can use NVIDIA OSMO, a cloud-native orchestration platform for scaling complex, multi-stage, and multi-container ‌robotics across a hybrid infrastructure. With OSMO, we accelerated data generation by 10x on 10 NVIDIA A40 GPUs.

With these steps, we created 25,880 samples with 84 action animations and 4-5 camera angles for 40 different characters:

8400 warehouse
6600 hospital
4800 retail
7600 sports

GIF shows generated action data for retail use cases. In this GIF, a customer is checking out. — *Figure 6. SDG scene created for a retail setting*

Train an action recognition model with synthetically generated data

Now, you can use the synthetic data to expand the capabilities of a spatial-temporal graph convolutional network (ST-GCN) model, a machine learning model that detects human actions based on skeletal information.

In this example, we trained the PoseClassificationNet model (ST-GCN architecture) on top of the 3D skeleton data produced by Isaac Sim with NVIDIA TAO, a framework for efficiently training and fine-tuning ML models.

A workflow diagram shows an input layer, normalization layer, 10 ST-GCN blocks, global average pooling, a fully connected layer, and the final output layer. — *Figure 7. Spatio-temporal graph convolution network architecture*

The skeleton data from Isaac Sim is first converted into a key point. A key point is either represented directly by a join in the character skeleton or calculated when no corresponding joint can be found. Character skeleton is defined in Renderpeople rigged assets. For more information about an example, see Bundle Casual Rigged 002 from Renderpeople.

Upon training the model, we developed different splits of data by varying the number of characters and frames of actions used. We found the best performance when truncating or padding all animations to 650 frames in the length of the animation sequence and training with 35 characters plus data for five additional characters exhibiting random jitters and rotations.

After training the ST-GCN model with TAO, we obtained an average of 97% test accuracy across 85 classes of action recognition. To further test the robustness of the model against real data, we used the NTU-RGB+D dataset’s 25 keypoint skeleton data for classes of actions that mapped well between those available in the NTU dataset and those in our custom SDG dataset.

NTU Action	Number of Samples	Model trained on SDG and tested on NTU (TOP 5)	Model trained on NTU and tested on NTU (Top 5)
Drinking water	948	89.14%	92.347%
Sitting down from a standing Up	948	98.73%	100%
Standing from Sitting	948	99.37%	100%
Falling	948	82.17%	95.82%
Walking apart	948	87.45%	94.68%
Make victory sign	948	99.46%	100%

Table 1. High accuracy results of the synthetic action recognition ST-GCN model (column 3) when tested on real data

Compared to state-of-the-art performances, the customized model performs well, considering that it was only trained on synthetic data and evaluated with zero-shot inferencing on NTU data that it wasn’t trained on and for significantly different action classes.

The training was iterative. Initially, the model performed well on some classes but poorly on others. To balance the dataset, we added more assets and variations within classes, such as the sitting class.

This refinement improved accuracy across all classes and SDG made it easy to scale.

A bar chart shows different classes such as falling, crane operation, examining, lifting heavy objects, and walking and the number of assets for each. — *Figure 8. Dataset distribution for each class before and after adding more assets*

Try it today

Synthetic data generation (SDG) accelerates model training by creating high-fidelity, artificial data when real-world data is limited. This helps to improve data diversity and generalize the model for a multitude of use cases and scenarios. SDG can improve model accuracy and performance.

Open-source frameworks such as SynthDa can also be used in conjunction with Isaac Sim to add more ways to generate synthetic data from real-world data.

Get started today with Isaac Sim: