Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Index Corpus Data for RAG#

The initial phase of a RAG pipeline involves indexing the corpus data used for retrieving relevant information for the text generation process. This phase includes chunking documents, extracting embeddings, and storing the chunks and corresponding embeddings in a database.

The provided script runs a complete process using the LlamaIndex library in conjunction with a pre-trained NeMo embedding model. LlamaIndex, a Python library, efficiently connects data sources to LLMs. It offers tools for chunking, embedding, managing the index, and retrieving text for generation. In this procedure, the pre-trained NeMo model serves as the embedder, extracting embeddings from chunked text within the corpus.

Run Indexing on a Base Model#

This section provides basic instructions for running RAG indexing on a Slurm cluster.

To initiate indexing:

Assign the stages variable in conf/config.yaml to “rag_indexing”.
Define the configuration for indexing by setting the rag_indexing variable in <embedder_model_type>/<model_size> to a specific embedder config file path.

For example, setting the rag_indexing variable to bert/340m specifies the configuration file path as conf/rag_indexing/bert/340m.yaml. This path corresponds to a BERT-type embedder model with 340 million parameters.

Run Indexing on a Slurm Cluster#

To run indexing on a Slurm cluster:

Set the run configuration in conf/rag_indexing/bert/340m.yaml to define the job-specific configuration:

run:
name: ${.eval_name}_${.model_train_name}
time_limit: "4:00:00"
dependency: "singleton"
nodes: 1
ntasks_per_node: 1
eval_name: rag_indexing
model_train_name: rag_pipeline
results_dir: ${base_results_dir}/${.model_train_name}/${.eval_name}

Set the path for the embedder checkpoint, corpus data, and saved index:

indexing:
   embedder:
      model_path: /path/to/checkpoint_dir
   data:
      data_path: /path/to/corpus_data
   index_path: /path/to/saved_index

Set the configuration for the Slurm cluster in conf/cluster/bcm.yaml:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
job_name_prefix: 'nemo-megatron-'
srun_args:
- "--no-container-mount-home"

Set the stages section of conf/config.yaml:
stages: - rag_indexing
Run the Python script:
python3 main.py
All the configurations are read from conf/config.yaml and conf/rag_indexing/bert/340m.yaml.