How I Created an AI Version of Myself
Generative AI could best be described as a frustrating breakthrough. When ChatGPT was first released in late 2022, there was open-mouthed, wide-eyed amazement at the quality of the natural language it produced. Since then, numerous updates to that product have come on the scene, as well as a plethora of competitor products.
But the initial excitement has given way to numerous frustrations about the technology. Controlling the output of these models seems challenging, with hallucinations meaning that the content of what they generate is not always reliable, even if the natural language is persuasive. Cut off dates often mean content is out of date. Models usually don’t have the background context for a specific request, meaning their response is often off the mark or too generic to be useful, especially in organizational or business settings.
Retrieval Augmented Generation (RAG) is a way of using large language models in a substantially more controlled way. Without expensive fine tuning, and using a fairly simple workflow, a model can be fed with relevant contextual information and restricted to only respond based on the information given, or to prioritize the information given. In this way, the true value of the large language model is released — as an automated natural language summarizer of content, and the undesirable behaviors such as hallucinations can be minimized or possibly even eliminated.
To illustrate this, I have constructed this technical tutorial where I have created a minimal version of a RAG application that answers questions related to my statistics textbook Handbook of Regression Modeling. At the end of this tutorial I will have constructed a simple pipeline that allows you to ask a question to a LLM and receive information only based on what is in my textbook. So it’s kind of like an AI version of myself, using only my statistics knowledge to answer your questions.
This tutorial is completely replicable as all of the material I use, including my textbook, is available open source. I even use an open LLM (Google’s Gemma-7b LLM, which is the foundation of its Gemini product). However — a word of warning: I am only able to run Gemma locally because I am using an extremely high spec machine (Macbook M3 Max with 128GB RAM). If you do not have such a high spec machine, you will need to use cloud resources or change the pipeline to use an API to a hosted LLM like chatGPT.
An overview of Retrieval Augmented Generation (RAG)
In a typical LLM interaction, a user will send a prompt directly to an LLM and receive a response purely based on the LLM’s training set. For general tasks, this can be useful. But when the prompt requires specialist knowledge or context, the response is usually unsatisfactory.
The idea behind RAG, is that the prompt pays a visit to a specialized knowledge database, picks up a few relevant documents, and takes them with it when it hits the LLM. The prompt can be constructed to restrict the LLM to answer only on the basis of the documents that accompanied the prompt, thus ensuring a higher quality of contextual response.
The architecture in the diagram above can be considered to have two components:
An information retrieval component, which finds and extracts documents in a document store that closely match the prompt. This is done using vector database where embeddings are used to determine document similarity
An LLM component, which constructs a new prompt using the original prompt and the matched documents and sends this to our LLM to elicit a response
In this tutorial I will construct a minimal example of this architecture using Python. You can find the full Jupyter notebook here.
Retrieving and preparing my documents
The document store will contain content from my textbook, which exists in open source form at this Github repo. The book is structured into 14 chapters and sections containing text, code and mathematical formulas, each of which is generated from a markdown document.
First I will install some packages that I need.
Now I will pull down the text of all 14 chapters of my textbook and store them in a Pandas dataframe:
Now, at this point I have 14 documents that have quite a large length. Thinking ahead, any documents that I send to my LLM will need to fit in its context window, which is a maximum for the number of tokens (words) that the LLM can process. With the current document length, I can’t control this, so I am going to need to split these documents into a larger number of shorter documents.
However, I can’t just randomly split them, I need to semantically split them so that any individual document is not cut off at some random point, and where the totality of the document makes semantic sense. I can use a neat function in the Python package to do this. I’m going to limit my documents to 1000 words and allow up to a 150 word overlap between them.
My 14 original documents have been split into 578 smaller documents. Let’s take a look at the first document so we can see what ‘s document format looks like:
We can see that the document object contains a key, which has our text of interest inside it, and a key, which is not of interest to us.
Setting up the vector database to contain my documents
Now that I have my documents of an appropriate length, I will need to load them to a vector database. A vector database stores text in both its original form, but also as embeddings, which are large arrays of floating point numbers, fundamental to how large language models process language. Words, sentences or documents that have ‘close’ embeddings in multidimensional space will be closely related to each other in content. See the diagram above for a 2D graphical simplification of the concept of embeddings.
When I submit a prompt, the vector database will use the embedding of the prompt, and find the closest embeddings from the documents it contains. There are numerous options for how to define ‘closest’. In this case I will use cosine similarity, which uses the cosine of the angle between two embeddings to determine distance. The higher the cosine similarity, the more related two documents are.
To make this easy I will use the Python package, which allows me to set up a vector database on my local machine. First I need to set up the database and define the language model and distance metric that it will use to generate embeddings. I will pick a standard, efficient language model for this purpose.
Now I am ready to load my documents. Note that vector databases have limits to how many documents can be loaded in any one operation. In my case I only have a few hundred documents so I should be fine, but I am going to set up a batch load in any case, just to be on the safe side.
There was only one batch needed, but in case you try this with a longer document set, you can use the code above to load multiple batches.
Now my document store is set up. I can test it to see if it returns documents that have some similarity to my query. I’ll ask a statistics question and request the three closest matching docs.
Looks pretty good to me! Now I have completed the information retrieval layer. Time to move on to the LLM layer.
Setting up Google’s Gemma-7b-it LLM on my machine
is the 7 billion parameter instruction-tuned version of the foundation model for Google’s Gemini. It is about 20GB in size, and is available on Huggingface. You will need to agree to terms and then get an access key to download and use the model. For a model of this size, you’ll need some pretty impressive CPU, GPU and RAM to run it.
First I will access the model and download it. This might take a while the first time you do it, but it will load quickly once downloaded and in your cache.
Let’s test it out. This model likes to receive prompts in the form of a conversation between model and user. Here is a recommended prompt format:
Now I am going to turn this prompt into an embedding that Gemma understand, send it to Gemma to generate a response, decode that response back into text, and slice off the new part that Gemma generated for me.
Oh man, that sounds delicious! We can see that Gemma generated a good quality response here. So we are ready for my final step, which is to set up the entire RAG pipeline, allowing the IR component to flow into the LLM component.
Setting up the RAG pipeline
In my basic pipeline here, I will define a function that takes a question and sends it to my vector database to retrieve some documents that closely match the question. Then it will combine the original question and the retrieved documents into a Gemma-friendly prompt, with instructions to stick to the documents in answering the question. Finally, it will encode the new prompt, use Gemma to generate a response, and decode the response.
Testing this AI version of me
Now let’s ask a few questions to see if it responds based on the knowledge which I wrote in my textbook.
That sounds like an answer I would give. Let’s try another:
Again, sounds like something I would say. And what if I ask it about something not covered in the textbook?
Nice! It’s staying in it’s lane, just like I would.
So what have we learned?
Not only has creating AI Keith been fun, but it also illustrates a simple architecture that can harness the power of LLMs but enhance them with contextual information and control their tendency to hallucinate. This has significant applications in knowledge search, where this architecture offers a route to summarizations of searches of large knowledge repositories.
As for AI Keith, I won’t be launching him any time soon. For a start, this example is too small scale, and trying to host an architecture like this in production and make it always available would likely be too expensive to justify its use. But I hope you can see the possibilities for larger scale situations.
What did you think of AI Keith? Have you played around with RAG architectures? Feel free to comment!
Excellent use case, sounds great! We're using a very similar approach.