Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
Index Corpus Data for RAG#
The initial phase of a RAG pipeline involves indexing the corpus data used for retrieving relevant information for the text generation process. This phase includes chunking documents, extracting embeddings, and storing the chunks and corresponding embeddings in a database.
The provided script runs a complete process using the LlamaIndex library in conjunction with a pre-trained NeMo embedding model. LlamaIndex, a Python library, efficiently connects data sources to LLMs. It offers tools for chunking, embedding, managing the index, and retrieving text for generation. In this procedure, the pre-trained NeMo model serves as the embedder, extracting embeddings from chunked text within the corpus.
Run Indexing on a Base Model#
This section provides basic instructions for running RAG indexing on a Slurm cluster.
To initiate indexing:
Assign the
stages
variablein conf/config.yaml
to “rag_indexing”.Define the configuration for indexing by setting the
rag_indexing
variable in<embedder_model_type>/<model_size>
to a specific embedder config file path.For example, setting the
rag_indexing
variable tobert/340m
specifies the configuration file path asconf/rag_indexing/bert/340m.yaml
. This path corresponds to a BERT-type embedder model with 340 million parameters.
Run Indexing on a Slurm Cluster#
To run indexing on a Slurm cluster:
Set the
run
configuration inconf/rag_indexing/bert/340m.yaml
to define the job-specific configuration:run: name: ${.eval_name}_${.model_train_name} time_limit: "4:00:00" dependency: "singleton" nodes: 1 ntasks_per_node: 1 eval_name: rag_indexing model_train_name: rag_pipeline results_dir: ${base_results_dir}/${.model_train_name}/${.eval_name}
Set the path for the embedder checkpoint, corpus data, and saved index:
indexing: embedder: model_path: /path/to/checkpoint_dir data: data_path: /path/to/corpus_data index_path: /path/to/saved_index
Set the configuration for the Slurm cluster in
conf/cluster/bcm.yaml
:partition: null account: null exclusive: True gpus_per_task: null gpus_per_node: 8 mem: 0 job_name_prefix: 'nemo-megatron-' srun_args: - "--no-container-mount-home"
Set the
stages
section ofconf/config.yaml
:stages: - rag_indexing
Run the Python script:
python3 main.py
All the configurations are read from
conf/config.yaml
andconf/rag_indexing/bert/340m.yaml
.