Name	Name	Last commit message	Last commit date
parent directory ..
config	config
images	images
README.md	README.md
docbuilder.py	docbuilder.py
filters.py	filters.py
main.py	main.py
modifiers.py	modifiers.py
synthetic_gen.py	synthetic_gen.py

Curating Datasets for Parameter Efficient Fine-tuning with Synthetic Data Generation

This tutorial demonstrates the usage of NeMo Curator's Python API data curation as well as synthetic data generation, and qualitative score assignment to prepare a dataset for parameter-efficient fine-tuning (PEFT) of LLMs.

We demonstrate the pipeline using the Law StackExchange dataset, which is a dataset of legal question/answers. Each record consists of a question, some context as well as human-provided answers.

In this tutorial, we implement various filtering and processing operations on the records. We then demonstrate the usage of external LLM services for synthetic data generation and reward models to assign qualitative metrics to each synthetic record. We further use NeMo Curator's facilities to iteratively augment and refine the data until the dataset has reached the desired size.

Note: The use of external LLM services for synthetic data generation is entirely optional. Similarly, this tutorial can be executed on a local machine without the need for a GPU. To fully experience all the capabilities of this code, see the "Optional Prerequisites" section below.

Overview of the Pipeline

The pipeline for this tutorial aims to demonstrate a basic loop with two stages as follows. These stages are repeated until the desired dataset size is achieved:

Data processing: perform operations such as HTML tag cleaning, quality-based filtering and semantic deduplication on the records.
Synthetic data generation: query a synthetic data generation model (such as LLaMa 3.1 405B Instruct, or Nemotron-4 340B Instruct) to produce synthetic variants of existing records. Each synthetic record is then fed to a reward model (such as Nemotron-4 340B Reward), and assigned a quality score. All records are then fed to the data processing stage for further processing.

The following diagram depicts the pipeline demonstrated in this tutorial:

Code Structure

This code is organized as follows:

main.py: the entry point to the code. Implements the data curation and synthetic data generation pipeline, and consists of the following high-level functionality:
- download_and_convert_to_jsonl(): contains the logic necessary to download the sample dataset and convert it into JSONL.
- random_split_rows(): contains the logic for spliting the dataset into training/validation/test splits.
- semantic_dedupe(): implements the semantic deduplication functionality (requires an NVIDIA GPU).
- run_curation_pipeline(): the main curation pipeline implementation. Captures the data processing, as well as the synthetic data generation operations.
docbuilder.py: contains the implementations of NeMo Curator document builder modules to facilitate dataset download and conversion into the JSONL format.
filters.py: contains the implementation of a score-based filtering mechanism, to filter out low-quality documents. Used in run_curation_pipeline().
modifiers.py: contains the implementation of the HTML-cleaning logic. Used in run_curation_pipeline().
synthetic_gen.py: abstracts the logic needed for invoking the synthetic data generation model, and also assigning reward scores to each record. Used in run_curation_pipeline().

Optional Prerequisites

The following is a list of optional dependencies to allow experimentation with all the features showcased in this code:

In order to run the data curation pipeline with semantic deduplication enabled, you would need an NVIDIA GPU.
To generate synthetic data, you would need a synthetic data generation model compatible with the OpenAI API. Out of the box, this tutorial supports the following model through the build.nvidia.com API gateway:
- Nemotron-4 340B Instruct
- LLaMa 3.1 405B Instruct
For assigning qualitative metrics to the generated records, you would need a reward model compatible with the OpenAI API (such as the Nemotron-4 340B Reward model).

Note: A valid build.nvidia.com API key is required to use any of the above models.

Usage

After installing the NeMo Curator package, you can simply run the following commands:

# Running the basic pipeline (no GPUs or external LLMs needed)
python tutorials/peft-curation-with-sdg/main.py

# Run with synthetic data generation and semantic dedeuplication
python tutorials/peft-curation-with-sdg/main.py \
    --api-key YOUR_BUILD.NVIDIA.COM_API_KEY \
    --device gpu

# To control the amount of synthetic data to generate using LLaMa 3.1 405B
python tutorials/peft-curation-with-sdg/main.py \
    --api-key YOUR_BUILD.NVIDIA.COM_API_KEY \
    --device gpu \  # Use the GPU and enable semantic deduplication
    --synth-gen-rounds 1 \ # Do 1 round of synthetic data generation
    --synth-gen-ratio 0.001 \  # Generate synthetic data using 0.1% of the real data
    --synth-gen-model "meta/llama-3.1-405b-instruct" # Use LLaMa 3.1 405B

# To control the amount of synthetic data to generate using Nemotron-4 340B
python tutorials/peft-curation-with-sdg/main.py \
    --api-key YOUR_BUILD.NVIDIA.COM_API_KEY \
    --device gpu \  # Use the GPU and enable semantic deduplication
    --synth-gen-rounds 1 \ # Do 1 round of synthetic data generation
    --synth-gen-ratio 0.001 \  # Generate synthetic data using 0.1% of the real data
    --synth-gen-model "nvidia/nemotron-4-340b-instruct" # Use Nemotron-4 340B

By default, this tutorial will use at most 8 workers to run the curation pipeline. If you face any out of memory issues, you can reduce the number of workers by supplying the --n-workers=N argument, where N is the number of workers to spawn.

Once the code finishes executing, the curated dataset will be available under data/curated/final. By default, the script outputs splits for training (80%), validation (10%) and testing (10%).

Next Step: Fine-tune Your Own Model

The curated dataset from this tutorial can be readily used for model customization and fine-tuning using the NeMo Framework. Please refer to the law title generation tutorial in the NeMo Framework repository to learn more.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

peft-curation-with-sdg

peft-curation-with-sdg

README.md

Curating Datasets for Parameter Efficient Fine-tuning with Synthetic Data Generation

Overview of the Pipeline

Code Structure

Optional Prerequisites

Usage

Next Step: Fine-tune Your Own Model

Files

peft-curation-with-sdg

Directory actions

More options

Directory actions

More options

Latest commit

History

peft-curation-with-sdg

Folders and files

parent directory

README.md

Curating Datasets for Parameter Efficient Fine-tuning with Synthetic Data Generation

Overview of the Pipeline

Code Structure

Optional Prerequisites

Usage

Next Step: Fine-tune Your Own Model