LinkedIn and 3rd parties use essential and non-essential cookies to provide, secure, analyze and improve our Services, and to show you relevant ads (including professional and job ads) on and off LinkedIn. Learn more in our Cookie Policy.
Select Accept to consent or Reject to decline non-essential cookies for this use. You can update your choices at any time in your settings.
How reliable are LLMs for RAG applications? The results might shock you!
I recently conducted a surprise test to evaluate the efficiency of top proprietary models in answering queries based on context from a Vector
Data Store.
The contenders:
(a) Gemini 1.5 Pro
(b) Claude 3.5 Sonnet
(c) GPT-4
Can you guess which model came out on top? Watch the video to discover the eye-opening results!
This experiment raises crucial questions about model reliability. Why do some models falter on prompts that others excel at? If these proprietary models are constantly improving, why do we still see inconsistent performance across simple context-based queries?
The implications are significant: How can we build dependable AI applications when their core functionality relies on potentially unpredictable LLMs?
This challenge presents exciting opportunities for engineering teams. Building LLM-based applications isn't just about implementation – it's about navigating the complexities of these powerful yet sometimes erratic models.
If you want to know what strategies we can deploy for having consistent performance from LLMs for RAG solutions, write to me in the comment and I will share it with you.
#AIReliability#LLMChallenges#RAGApplications#AIEngineering#openai#claude#googlegemini
Hello and in this video I want to show you the comparison of three proprietary models Open AI GPT 4 then we have a cloud which is 3.5 sonnet and then Google Gemini 1.5 Pro. I'll be using one by one each one of them and gives the RAG application where in my RAG application I'm trying to give a context, a good amount of context to the model and ask them to. Analyze the context and build a response. So let's test it out how each of these models perform on the given context. So let's see the prompt first, which I'm using for the RAG application which I'll be feeding two LLM to answer the question. So the prompt is very straightforward, I'm asking from LLM to actors of financial analyst and provide the insights based on the financial data provided and with some instructions like the context utilization, for example, the context given context pointer relevant information to the user's query. Then focus only on the specific question ask because there will be some time. We end up providing too much context, but that is not relevant to the user query, so LLM should be able to figure out. The the the appropriate content to build the response and then further we have our answer structure. You know you should build the answer if the question is direct and just respond it otherwise provide a detailed information if it is open-ended question. So this is what I asked in the response length and then accuracy. Honesty. If you don't know the answer just tell that you don't know stated clearly and follow the instructions. Keep the things clear and coherence. Now what I'm going to do is first I'll start with. First I will start with the Gemini, so I have 3 models defined here. One is Google alarm so I'm using Gemini here, then plot sonet and then LLM which is the open eye mini. So I will not use mini for this test and maybe I'll do mini at the last. But first I will do this with the Google so let's see how it does this. LM performance. Alright, sorry. Alright, so we have the element set. Let me start. This is my stream rate application, so I'll start the stream late. Run Rag app. Right, so here the app is starting. I'll close this previous windows. OK I have started so I'm using pine cone which is index loaded. So I have two type of search here document search and augmented search. Document searches it is just going to go to B vector B and give me the data. So I'll show that first what data is coming from DB so that we know that every model is getting the same data. Uh, but response will be variation. So you know why there is a variation in the response. That's what we want to figure out. SO and 2nd is Augmented search where LLM will be asked to answer the question. So first I'll do the document search. I'll keep it number of results to four. So my question is sorry. What is one second? So I also show you what is the. So this is an annual report of one of the company's Indian listed Indian exchange listed company called Aarti Industries. So I'm using their data annual report and there is one page which is talking about the nature of dues. So if you see the tabular data and this data is already loaded in the vector TB. So what I will do is I will ask. Maybe I will just copy the question from here. I'll go back. First I will ask what? What are the nature of dues on the company? OK, so here it is just a plain vector search, no LLM involved. So just searching. We got 4 results. So here we got that first page. If you see here this One Financial Act, Service Act 2008 to 9 and to 11:12. So this is the data we got with additional because we have additional information also here. So everything is coming here. Then we have another page with the financial statement, see the custom duty service tax, which is actually. Paid before this. This is a page before that. Which has again the nature of those pending. And then there are two more pages it found or chunks it found which have some lot of some text. May not be directly related to what I asked, but this is because I asked for four different records. So it has given me OK. So if you see there's a lot of context which we are going to give it to model. Now we know that we have a model. Which is. Google Gemini 1.5 Pro, so let's test it out. So we have the index loaded, we got to augmented search. We'll ask what are the nature of dues on the company. So let's go. OK, I think there's some UI problem, but let's see that it is. The query is already running. I'll open the window. Alright, so here is a response. If you look at this, the response we got is this document excerpt from the Earth Industry Limited Annual Report. This does not contain the information about the company's revenue. Therefore, I cannot answer your question. Same thing we got here. And the context which I've given is. I also printing here the context, so maybe I'll copy it. I'll just see there's such a big context. This is the one starting from system. I'll just put it on the. Text editor show how big the context was. So this whole context was given to the LLM. Starting from here that you are though this is my prompt for comma I think till here we have the prompt which is I showed and then after that this is a data. The page content and before they're below everything is data, a lot of data, the same data which I showed you. Under this part of the code. Right now anyway, this has gone, but we saw what happened with. Gemini Pro OK, Now what we will do is we will go back to the application. Stopped this application and we also see input tokens 4583. It only returned 40 tokens. We can see that this is the response we got. No, we'll go quickly to the code and change it to cloud. Get. Claude Alem. Everything is same, but just the model is different. We'll start the application. And we'll ask the same question to God. I went search and make it 4. And I'll still run the same same query again. So let me run the same query again first to vector DB so that we are sure that this is the same data coming back. All right, so this is a plane vector DB search, so we got the same data back first table. And then second table here, Alright, so we have the same data coming here. Now I go to augmented search, I say same query no let's search it. Alright, so no such square is running, but is gone. Alright, here is the response. It's quick. Response is quick, but you see the response based on the context provided. I can answer your query about. I can answer your query about the Industries limited. What specific information would you like to know? So I'm surprised what else they need to tell me the answer. OK, now we know that we did not get the answer and these are the. Proprietary models Top models trending these days. On the Internet. So let me go and go to the third one, which is. GPO, in this case four O, so I'll remove this. Get. Google. Sorry, I think the under score get L. Again, I'll show it. I'll double check. I'll go to the method we are using GPT for. All right. Now let's run this code again. OK, we'll close other one. We don't need those. And again I will run the same query first the document search. Yeah, I'll just copy. Do the plane search so that we are sure that this model will get the same data which others got. OK got the same data here. Second table, yes, financial statements and then a lot of text, so a lot of content. Let's go back. We do the augment search now we say number of records 4. Augment search alright, just UI problem switch to vertex already doing search. So yes this model. Is going to take a little bit time. It is slow, which I think GPD 4:00 PM already claimed and that's why the release meaning. But you look at the response now your response we got 692 tokens were given back, but let's see the response. OK, look at this. These are the detailed summary, detail information, text disputes, OK, cool service text GST, GST, VAT, VAT, Income tax. With the amount here, let's open the report. So the second table here service tech, GST, GST, VAT, VAT, income tax returns so quickly picked up. Yes, it could have done better if it would have also involved the custom duty because that was also given it. But I think for the tuning of the prompt, that is always the scope of the improvement in the prompt. But the point is that same prompt given to all three and GPT 4 or was able to analyze the data clearly extract the relevant information. Based on the question or questions which is missing here, but the question was this what are the nature of juice on the company and then it also add told additional information because like any other areas it has identified where users are there or not. So great response from GPO. That's all I wanted to show that how these models. Are changing week over week. So it is very difficult to rely on each one of them because you have some prompts, you have some application built on depending on these models, but some point of time in future they're not getting the same response what you were getting few days back. So what to do? So that is what in this video we saw that at least this week which is. Ending with the second week of August or probably third week of August. This is this week. GPT 4 O is able to do a better job than others. That's all I wanted to show. Thanks a lot.
The inherent stochasticity of LLMs, coupled with their reliance on statistical patterns rather than explicit reasoning, can lead to inconsistencies in performance even on seemingly straightforward tasks. Fine-tuning these models on specific domains and incorporating techniques like prompt engineering and knowledge distillation can mitigate this variability, but achieving truly reliable performance remains an ongoing challenge. You talked about in your post. Given the potential for hallucination in LLMs, how would you design a robust mechanism to identify and rectify factual errors generated by these models when responding to queries based on sensitive medical data? Imagine a scenario where an LLM is tasked with summarizing patient records for a physician; how would you technically leverage to ensure the accuracy and reliability of the generated summaries in this context?
Chief Technology Architect and Head of Engineering with a Proven Track Record at Standard Chartered Bank, Infosys & HCL | ISB Alumnus | AWS Certified | Published Author | Experienced Software Consultant
When it comes to Knowledge Base Q&A, Retrieval Augmented Generation(RAG) stands out as a promising LLM Architectural design pattern.
Here's why RAG is a game-changer:
- Fresh Information: RAG ensures you have access to the most current and context-specific data sourced from external databases.
- Minimized Hallucination: By utilizing LLM models and context, RAG provides you with verifiable facts and citations from a knowledge base, minimizing inaccuracies.
- Cost-Efficiency: RAG allows for the easier implementation of pre-trained open-weighted models like GPT-4, LLama, and Gemini, eliminating the need for extensive domain-specific model training or fine-tuning.
Stay ahead of the curve with RAG's cutting-edge capabilities in data sourcing and analysis. #DataAnalysis#AI#LLM#OpenAI
🌟 What an incredible roadmap! 🌟
This guide is a treasure trove for anyone aspiring to master the intricacies of LLMs. Each milestone, from understanding the architecture to advanced fine-tuning techniques, is a stepping stone towards becoming a proficient LLM Scientist.
Thank you so much for sharing such an invaluable resource. I’m excited to dive deeper and explore each aspect in detail!
#LLM#MachineLearning#AI#DeepLearning#DataScience#RavitJain#TheRavitShow#AICommunity 🤖🚀📚
Founder & Host of "The Ravit Show" | LinkedIn Top Voice | Startups Advisor | Gartner Ambassador | Evangelist | Data & AI Community Builder | Influencer Marketing B2B | Marketing & Media | (Mumbai/San Francisco)
The Ultimate LLM Scientist Roadmap! Definitely everyone has their own journey. What would you like to add?
Key Milestones:
1. The LLM Architecture
Understand the fundamental components of LLMs, including tokenization, attention mechanisms, and text generation
2. Building an Instruction Dataset
Learn how to create and refine datasets to instruct LLMs effectively
3. Pre-Training Models
Delve into the process of pre-training models with a focus on data pipelines, scaling, and high-performance computing
4. Supervised Fine Tuning
Master techniques for fine-tuning models to improve their performance on specific tasks
5. RLHF (Reinforcement Learning from Human Feedback)
Explore methods to optimize models using human feedback
6. Evaluation
Assess model performance with a variety of metrics and benchmarks to ensure high-quality results
7. Quantization
Discover techniques to make models more efficient and faster through quantization
8. Interface Optimization
Optimise the interaction with LLMs using advanced techniques like flash attention and speculative decoding
Stay ahead in the AI field by following these structured steps to deepen your understanding and expertise.
Follow The Ravit Show Newsletter with 130k subscribers here — www.theravitshow.com#data#ai#llms#theravitshow
The Ultimate LLM Scientist Roadmap! Definitely everyone has their own journey. What would you like to add?
Key Milestones:
1. The LLM Architecture
Understand the fundamental components of LLMs, including tokenization, attention mechanisms, and text generation
2. Building an Instruction Dataset
Learn how to create and refine datasets to instruct LLMs effectively
3. Pre-Training Models
Delve into the process of pre-training models with a focus on data pipelines, scaling, and high-performance computing
4. Supervised Fine Tuning
Master techniques for fine-tuning models to improve their performance on specific tasks
5. RLHF (Reinforcement Learning from Human Feedback)
Explore methods to optimize models using human feedback
6. Evaluation
Assess model performance with a variety of metrics and benchmarks to ensure high-quality results
7. Quantization
Discover techniques to make models more efficient and faster through quantization
8. Interface Optimization
Optimise the interaction with LLMs using advanced techniques like flash attention and speculative decoding
Stay ahead in the AI field by following these structured steps to deepen your understanding and expertise.
Follow The Ravit Show Newsletter with 130k subscribers here — www.theravitshow.com#data#ai#llms#theravitshow
🚀 The Ultimate LLM Scientist Roadmap 🚀
This is a fantastic guide that covers all essential aspects, from understanding the architecture to advanced fine-tuning techniques.
Some thoughts:
1. The LLM Architecture - A great starting point to understand the core components and how they interact.
2. Building an Instruction Dataset - Crucial for creating effective and accurate models.
3. Pre-Training Models - Delving into data pipelines, scaling, and computing is key for robust models.
4. Supervised Fine Tuning - Enhancing model performance through targeted fine-tuning is a game-changer.
5. RLHF (Reinforcement Learning from Human Feedback) - Utilizing human feedback for model optimization is a brilliant approach.
6. Evaluation - Ensuring high-quality results with thorough benchmarking and metrics is essential.
7. Quantization - Making models more efficient and faster through quantization techniques is highly beneficial.
8. Interface Optimization - Optimizing interaction with LLMs through advanced techniques like flash attention and speculative decoding is critical.
Thanks Ravit Jain for sharing this invaluable resource! Looking forward to diving deeper into each milestone and mastering the journey of becoming an LLM Scientist.
#LLM#MachineLearning#AI#DeepLearning#DataScience#RavitJain#TheRavitShow#AICommunity
Founder & Host of "The Ravit Show" | LinkedIn Top Voice | Startups Advisor | Gartner Ambassador | Evangelist | Data & AI Community Builder | Influencer Marketing B2B | Marketing & Media | (Mumbai/San Francisco)
The Ultimate LLM Scientist Roadmap! Definitely everyone has their own journey. What would you like to add?
Key Milestones:
1. The LLM Architecture
Understand the fundamental components of LLMs, including tokenization, attention mechanisms, and text generation
2. Building an Instruction Dataset
Learn how to create and refine datasets to instruct LLMs effectively
3. Pre-Training Models
Delve into the process of pre-training models with a focus on data pipelines, scaling, and high-performance computing
4. Supervised Fine Tuning
Master techniques for fine-tuning models to improve their performance on specific tasks
5. RLHF (Reinforcement Learning from Human Feedback)
Explore methods to optimize models using human feedback
6. Evaluation
Assess model performance with a variety of metrics and benchmarks to ensure high-quality results
7. Quantization
Discover techniques to make models more efficient and faster through quantization
8. Interface Optimization
Optimise the interaction with LLMs using advanced techniques like flash attention and speculative decoding
Stay ahead in the AI field by following these structured steps to deepen your understanding and expertise.
Follow The Ravit Show Newsletter with 130k subscribers here — www.theravitshow.com#data#ai#llms#theravitshow
Unraveling Data Mysteries: The Essence of Unsupervised Learning
Dive into the world of unsupervised learning, where algorithms uncover hidden patterns without predefined labels. Clustering algorithms group similar data points seamlessly, while dimensionality reduction techniques streamline complex datasets. From data exploration to real-world applications, unsupervised learning reshapes how we glean insights from the vast sea of information. Embrace the power of the unknown in the era of artificial intelligence.
#UnsupervisedLearning#DataScience#AIInsights
Data Scientist in Tech | Leveraging Data for Insights | Seeking New Challenges | Driving Impact | Python | Machine Learning | Data Analysis | SQL | TensorFlow | NLP
𝗗𝗮𝘆 𝟲𝟲 𝗼𝗳 𝟭𝟬𝟬: 𝗪𝗵𝗮𝘁 𝗔𝗿𝗲 𝘁𝗵𝗲 𝗞𝗲𝘆 𝗗𝗶𝗳𝗳𝗲𝗿𝗲𝗻𝗰𝗲𝘀 𝗕𝗲𝘁𝘄𝗲𝗲𝗻 𝗖𝗼𝗻𝘃𝗼𝗹𝘂𝘁𝗶𝗼𝗻𝗮𝗹 𝗡𝗲𝘂𝗿𝗮𝗹 𝗡𝗲𝘁𝘄𝗼𝗿𝗸𝘀 (𝗖𝗡𝗡𝘀) 𝗮𝗻𝗱 𝗩𝗶𝘀𝗶𝗼𝗻 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗲𝗿𝘀 (𝗩𝗶𝗧𝘀)?
Today, we’ll explore the differences between CNNs and ViTs, two powerful models for image tasks.
𝗖𝗡𝗡 𝘃𝘀. 𝗩𝗶𝗧:
🔶 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲:
1️⃣ CNNs use convolutions to capture local patterns (like edges) and gradually build complexity.
2️⃣ ViTs process images by splitting them into patches, applying self-attention to model global relationships.
🔶 𝗙𝗲𝗮𝘁𝘂𝗿𝗲 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻:
1️⃣ CNNs detect patterns in small regions of the image.
2️⃣ ViTs focus on the entire image, capturing both local and global information.
🔶 𝗗𝗮𝘁𝗮 𝗥𝗲𝗾𝘂𝗶𝗿𝗲𝗺𝗲𝗻𝘁𝘀:
1️⃣ CNNs work well with smaller datasets.
2️⃣ ViTs require large datasets or pre-training to perform effectively.
🔶 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆:
1️⃣ CNNs are computationally efficient with smaller images.
2️⃣ ViTs are more expensive, especially with smaller image sizes.
𝗪𝗵𝘆 𝗜𝘁 𝗠𝗮𝘁𝘁𝗲𝗿𝘀:
CNNs are great for tasks with localized patterns, while ViTs excel in capturing global relationships in large datasets.
𝗪𝗵𝗶𝗰𝗵 𝗺𝗼𝗱𝗲𝗹 𝗱𝗼 𝘆𝗼𝘂 𝗽𝗿𝗲𝗳𝗲𝗿 𝗳𝗼𝗿 𝗶𝗺𝗮𝗴𝗲 𝘁𝗮𝘀𝗸𝘀—𝗖𝗡𝗡𝘀 𝗼𝗿 𝗩𝗶𝗧𝘀? 𝗟𝗲𝘁’𝘀 𝗱𝗶𝘀𝗰𝘂𝘀𝘀 𝗯𝗲𝗹𝗼𝘄!
#visiontransformers#convolutionalneuralnetworks#cnn#vit#machinelearning#deeplearning#computerVision#transformers#neuralnetworks#imageprocessing#ai#modelarchitecture#selfattention#100daysoflearning#artificialintelligence#datalearning#imageanalysis#datascience#visualdata#bigdata
I just stumbled upon this fantastic video about Principal Component Analysis (PCA) on YouTube, and I couldn't resist sharing it with all of you!
🎥 Video Link: Principal Component Analysis (PCA) Explained
https://lnkd.in/eN3Zx9Qj
🔍 Summary:
PCA is a powerful dimensionality reduction technique widely used in data science and machine learning. From understanding the fundamental concepts behind PCA to its practical applications, this video provides an informative and engaging comprehensive overview.
Key Takeaways:
1️⃣ PCA allows us to reduce the dimensionality of complex datasets while preserving the most critical information.
2️⃣ By transforming high-dimensional data into a lower-dimensional space, PCA enables more efficient analysis and visualization.
3️⃣ The presenter walks us through the step-by-step process of performing PCA, including eigenvalue decomposition and selecting the optimal number of principal components.
4️⃣ Real-world examples and visualizations help illustrate the concepts discussed, making them easier to grasp and apply in our projects.
#PCA#DataScience#MachineLearning#DimensionalityReduction#YouTubeTutorial#LearningResources#DataAnalysis#DataVisualization#AI#ML
Co-founder & CEO at VoiceX| ML&LLM Engineer| Nvidia Inception Program & MemberMicrosoft for Startups| Fractional CTO| Pedigree:Harvard, Databrics, Nvidia.
Temperature visualization in LLMs - explained!🤔🧐🤓
Temperature scale in LLMs is a parameter that controls the randomness of the generated text. It's a value between 0 and 1, where 0 means deterministic and 1 means non-deterministic or creative.
###A low temperature (close to 0) makes the LLM more likely to choose the most probable next word, resulting in more coherent but less varied text.
###A high temperature (close to 1) makes the LLM more likely to choose less
probable words, resulting in more diverse but less fluent text
###Low Temperature (close to 0):
In a low perplexity model, you'll see a very flat probability distribution (ex. graph below) with a peak at one or two good tokens to pick next.
###High Temperature (close to 1):
For creativity, you might want to intentionally not pick the most obvious next word, which is the temperature value that you see as an option for interacting with large language models.
Therefore, 1 is considered a creative temperature, but it may also produce nonsensical or irrelevant text. The optimal temperature depends on the task, the data, and the model architecture
#llms#llm#genai#ai
👉A question that often comes up nowadays when talking to customers is “Why couldn’t we just stuff all our documents into the context?”
👉 Technically you could, if you don’t care about cost or inference latency, but even if that isn’t a factor, the more you stuff the context with distractors, the less likely it is that the LLM will answer with the relevant facts.
#LLM contexts keep getting longer, but are they reliable?.......A study reveals even #GPT4 gets only 40-80% accuracy on long contexts, sometimes 0% 😳
Models are increasingly released with massive context lengths in the millions, however evaluation of these long contexts still focuses on perplexity and synthetic needle-in-a-haystack tasks, which don't reflect real world usage or the need to understand entire contexts provided.
An academic team developed a new benchmark LongICLBench to evaluate #LLMs complete understanding of long contexts and found GPT4, #Gemini 1.0 and #opensource models all struggle.
Method:
🔹 Evaluated 13 long-context LLMs incl. GPT4 and Gemini 1.0 (no Claude 3 😢)
🔹 Used 6 classification datasets (e.g. relationships, emotions, entities)
🔹 Provide prompts with 2-50k tokens, including as many as 174 classes
🔹 Require LLM to classify labels requiring comprehension of entire context
Results:
🔸 GPT4 performed best, but only ~50% and ~0.5% on two datasets
🔸 GPT4 often performed better at longer context lengths
🔸 Gemini 1.0 Pro was sometimes outperformed by open models
🔸 Models exhibit bias towards classes towards end of context
Paper 👉 https://lnkd.in/gX5g9q-q
I would love to read a similar paper that compares top-tier models like GPT-4-Turbo, Claude and Mistral-large, etc.
Also, the degradation in results from the Grouped Distribution was really surprising!
#LLM contexts keep getting longer, but are they reliable?.......A study reveals even #GPT4 gets only 40-80% accuracy on long contexts, sometimes 0% 😳
Models are increasingly released with massive context lengths in the millions, however evaluation of these long contexts still focuses on perplexity and synthetic needle-in-a-haystack tasks, which don't reflect real world usage or the need to understand entire contexts provided.
An academic team developed a new benchmark LongICLBench to evaluate #LLMs complete understanding of long contexts and found GPT4, #Gemini 1.0 and #opensource models all struggle.
Method:
🔹 Evaluated 13 long-context LLMs incl. GPT4 and Gemini 1.0 (no Claude 3 😢)
🔹 Used 6 classification datasets (e.g. relationships, emotions, entities)
🔹 Provide prompts with 2-50k tokens, including as many as 174 classes
🔹 Require LLM to classify labels requiring comprehension of entire context
Results:
🔸 GPT4 performed best, but only ~50% and ~0.5% on two datasets
🔸 GPT4 often performed better at longer context lengths
🔸 Gemini 1.0 Pro was sometimes outperformed by open models
🔸 Models exhibit bias towards classes towards end of context
Paper 👉 https://lnkd.in/gX5g9q-q
Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer
4moThe inherent stochasticity of LLMs, coupled with their reliance on statistical patterns rather than explicit reasoning, can lead to inconsistencies in performance even on seemingly straightforward tasks. Fine-tuning these models on specific domains and incorporating techniques like prompt engineering and knowledge distillation can mitigate this variability, but achieving truly reliable performance remains an ongoing challenge. You talked about in your post. Given the potential for hallucination in LLMs, how would you design a robust mechanism to identify and rectify factual errors generated by these models when responding to queries based on sensitive medical data? Imagine a scenario where an LLM is tasked with summarizing patient records for a physician; how would you technically leverage to ensure the accuracy and reliability of the generated summaries in this context?