How To Summarize Public Opinion Using RAG AI
Having now spent almost two years being exposed to the new generation of generative models (starting with chatGPT), we are starting to accept the fact that these have limited value to us on their own, without being enhanced with other information. It’s becoming clearer that the value lies in how we integrate large amounts of proprietary information with the natural language generation capabilities of these models to achieve effective and efficient ways of giving us digestible summaries of that information.
Retrieval Augmented Generation (RAG) is a growing architecture that is simple and inexpensive, and is increasingly being used to experiment with improving the quality of informatiom returned by large language models. Described simply, proprietary documents are stored in a secure vector embedding database. Instead of a prompt (question) being sent directly to an LLM, the prompt is first compared with the documents in the proprietary database, and a number of closely matching documents are retrieved and added to the original question. A prompt is constructed to reflect the original question and the proprietary document information, and is sent to the LLM to generate an answer to the question based on the proprietary information. A simple RAG architecture can be seen in this diagram, and you can see my previous article where I used a RAG architecture to build a simple Q&A engine around my statistics textbook.
One potentially valuable RAG use case is where varied opinion provided in a large number of documents needs to be summarized into a digestible form. For example, in politics, the ability to be able to ask a question on a specific topic and obtain a summarized view of what the public or a voter group think about it could be very valuable. In business settings, many organizations struggle with how to make productive use of the large amount of text that comes in from customer or employee surveys. The ability to ask a question and get a summarized response based on this text would be a very valuable source of targeted intelligence.
Getting a summary of what New York Times readers think
In this tutorial I will demonstrate how to build this sort of application by taking a large data set of reader comments to articles in the New York Times during 2017 and 2018. I will embed all these comments into a vector database and then build a pipeline to allow us to ask what NY Times readers think about a specific topic, drawing on all these comments and using a variety of options for LLMs to summarize them.
The data I am going to use is publicly available on Kaggle here, and contains over 2.1 million comments left on articles by readers in various months in 2017 and 2018. The steps I will go through are as follows:
Reduce the Kaggle data set to just the comments made by users (removing other data from the dataset)
Embed these comments into a vector database
Construct a RAG pipeline function which takes a question, matches it with closely matching reader comments from our vector database, and then sends everything to our choice of LLMs to provide a summarized opinion from NYT readers
Finally, deploy all of this to a streamlit application, providing a user interface to make it simple for us to get the summarized opinions of NY Times readers.
If you don’t want to read through all the details of this tutorial, you can find all my code here — go forth and build!
Step 1: Getting the data from Kaggle and preparing it for loading into a Vector DB
The first step is basic data processing. There are a lot of files, so to make this easy, we will get the data files we need via the Kaggle API and then reduce them to just the files that contain user comments. To do this using the code below, you’ll need to obtain your Kaggle API key via your Kaggle account and store it in a kaggle.json file in your project root.
Next, we will create a single Pandas DataFrame containing only the comment text from all of the comment files, and also the year in which the comment was made (we will use the year as a metadata to filter on later). Finally, we will save this large Pandas DataFrame as a pickle file to allow us to easily load it up for the next step:
Step 2: Creating a Vector DB to store all our reader comments
Now we will set up a Vector DB and load our comments into it. A Vector DB uses a selected embedding model to store a document as a vector embedding. It will also store the text and any associated metadata.
In this case I will use a Chroma DB to set up a vector database on my local machine. This is only recommended if you have a powerful machine (I use a Macbook M3 Max), otherwise you should consider setting up a ChromaDB cloud database for this experiment, or using a fully managed cloud vector DB such as Pinecone.
First, we load the data from our prior pickle file and remove empty or very short comments from the dataset. Then we load the comments as langchain document objects, which will include the year of the comment as a metadata field.
Next we set up a Chroma DB in our local project directory. We select a model to covert comments into embeddings. We then define our document collection, giving each comment a unique ID, and including the year as a metadata field, and using cosine similarity as the distance function that will be used to determine which are the most relevant documents to our query. On my advanced machine, it took several hours to embed and load all this into ChromaDB. You can simplify and make quicker by further filtering the documents that you load (for example, you can restrict the comments to a specific month).
This created a ChromaDB that was about 12GB on my machine. Once this is all done, we can test by sending a query to our ChromaDB. The DB will use the same embedding model to embed the query and then use cosine similarity to fine the closest matching user comments in the database.
In this test query, we ask a question about user opinions on US foreign policy towards North Korea, and we request the ten closest comments made in 2017.
This produces the following (truncated) comments:
Step 3: Creating a function that executes a RAG pipeline
Now that all our readers’ comments are loaded into a vector DB, we now just have to write a pipeline that takes a question, collects matching comments from the database, and then constructs a prompt to send to an LLM of our choice to receive a summarized response.
To make this easy, let’s decouple the prompt construction from the LLM interaction. In this function we take a set of documents which will be user comments, and we take a question (which is the original prompt). We number the comments and then construct a prompt into which we will insert the original question and the numbered comments — this function will output the constructed prompt to be sent to the LLM.
Now we can write a function for the overall pipeline, which will take a question, an LLM client, a database client, as well as parameters to define the number of matching documents requested and any required filters. The function will then find the matching comments from the database, use our construct_prompt() function to combine everything and send it to the LLM client to obtain the final summarized result.
Here is an example of a function which calls GPT-4. This function will be limited by GPT-4’s context window of 8192 tokens, meaning that selecting too many comments may break the function. Note also that I have hidden my OpenAI credentials as environment variables in this code, and I suggest you do the same, especially if you intend to share your code:
Let’s test our function by asking it to summarize the top 30 matching user comments related to our original question on US foreign policy towards North Korea:
Similarly we can set up some local smaller language models on our machine to try the same RAG pipeline (see the corresponding code for this here). Note that we do not have to concern ourselves with token limits here. For example, here’s the response I get asking the same question to Llama3’s 70B parameter model using ollama on my machine, asking it to consider 100 matching comments:
Step 4: Building a front-end to this application using Streamlit
Now we’ve really done all the hard work in building the back-end RAG pipeline, it’s pretty easy to use a bit of Steamlit code to serve this up with a simple front end. I won’t go into intricate detail here, but I have packaged all my prior clients, objects and functions into a module which I have called ask_question which you can find here.
Then I use streamlit to serve up a simple application. The application has some options, including a choice of language model to use — I’ve added in Anthropic’s Claude 3 Opus as an additional LLM for good measure here. You can find the streamlit code here.
Let’s ask Claude our question on North Korea, given that its very large cotext window allows us to use a lot of user comments:
Let’s ask Claude what readers think about the music of Taylor Swift:
What do you think of this kind of RAG application to summarize the opinions of readers, the general public, voters, customers or employees? Feel free to comment!
I/O Psychologist- People analytics consultant for people and talent data
5moI really like how the first summary includes the comment number as a source citation. How was that done / is it reproducible? One area I’m finding in RAG for survey comments is that most topical responses may be very simple and few people leave detailed responses (Example most people just say compensation and not the reasons why they said compensation). As a result summarization seems to over rotate on these fewer but more detailed responses in the summary. So it seems like it’s a broad sentiment but it’s only informed by one comment. I think this source citation technique could help make that clear when reading a summary.
AI Generalist, Creative Catalyst
5moBravo! How hard would this be to implement on say, YouTube comments? Existing RAG only seems to grok transcripts.