🚀 Big news! We launched our blog this week! 👩🚀 Our first post shows how to use trufflepig to create a custom semantic search engine for your unstructured data. We demonstrate this with our Better GitHub Search app, built on the trufflepig API, which makes GitHub's popular repositories searchable via natural language. 😁 You can apply these same principles to index and search through any unstructured data you have, making it easier for your AI models to find the relevant information in your data fast. 👍 The project is open source. Check it out, share your thoughts in the comments, and follow trufflepig to boost your AI projects with the power of retrieval! ✍ Blog: https://lnkd.in/gSby5AZ8
About us
trufflepig is a retrieval API that developers can use to build performant, accurate RAG (retrieval augmented generation) systems for generative AI applications.
- Industry
- Technology, Information and Internet
- Company size
- 2-10 employees
- Type
- Privately Held
Updates
-
🛠️ 𝗣𝗿𝗼𝗱𝘂𝗰𝘁 𝗨𝗽𝗱𝗮𝘁𝗲: 𝗪𝗲 𝗹𝗮𝘂𝗻𝗰𝗵𝗲𝗱 𝗺𝗲𝘁𝗮𝗱𝗮𝘁𝗮 𝗳𝗶𝗹𝘁𝗲𝗿𝗶𝗻𝗴! 🛠️ When performing vector search, you are comparing vector embeddings looking for what's semantically similar. However, when you have specific criteria that you want your search results to meet, vector search alone is unreliable. This is particularly evident with large datasets. 💡 𝗘𝗻𝘁𝗲𝗿 𝗺𝗲𝘁𝗮𝗱𝗮𝘁𝗮 𝗳𝗶𝗹𝘁𝗲𝗿𝗶𝗻𝗴 💡 By attaching attributes like timestamps, authors, and tags to documents, metadata filtering further refines search queries to improve search quality. With trufflepig, you can now seamlessly integrate metadata filtering when uploading data and we wrote a blog about using it with trufflepig, which features a practical example of how metadata filtering can boost your search results in RAG. 🍇 𝗛𝗼𝘄 𝗱𝗼𝗲𝘀 𝗺𝗲𝘁𝗮𝗱𝗮𝘁𝗮 𝗳𝗶𝗹𝘁𝗲𝗿𝗶𝗻𝗴 𝘄𝗼𝗿𝗸 𝘄𝗶𝘁𝗵 𝘁𝗿𝘂𝗳𝗳𝗹𝗲𝗽𝗶𝗴? 🍇 1️⃣ Prepare your data. Ensure the documents you upload to trufflepig have relevant metadata fields such as title, topic, year, and author if needed for your use case. This helps in organizing and refining search results more effectively. 2️⃣ Define metadata filters in your query. When creating a search request, specify filters based on the metadata fields. For instance, you can limit searches to specific topics or dates to get more precise results. For more about trufflepig’s metadata filtering and its application, check out our latest blog post: https://lnkd.in/g5w3fEey 🐷 Follow trufflepig for product and AI / RAG tips! 🐷 😁 Leave us a comment with your thoughts!
-
𝗟𝗼𝗼𝗸𝗶𝗻𝗴 𝘁𝗼 𝗶𝗺𝗽𝗿𝗼𝘃𝗲 𝘆𝗼𝘂𝗿 𝗥𝗔𝗚 𝘀𝘆𝘀𝘁𝗲𝗺'𝘀 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲? 𝗧𝗿𝘆 𝗺𝗲𝘁𝗮𝗱𝗮𝘁𝗮 𝗳𝗶𝗹𝘁𝗲𝗿𝗶𝗻𝗴 𝗳𝗼𝗿 𝗮𝗻 𝗶𝗻𝘀𝘁𝗮𝗻𝘁 𝗯𝗼𝗼𝘀𝘁! Metadata is data about data, like titles, descriptions, and keywords. For search engines like Google, metadata helps understand 💡 webpage content without reading the entire text, improving search 🎯 accuracy. In RAG, you can tag metadata during pre-processing to make it identifiable. For example, tagging books with author, title, version, and genre metadata makes these items within large datasets searchable 🔎 and more readily identifiable and available. 𝗛𝗼𝘄 𝗱𝗼𝗲𝘀 𝗺𝗲𝘁𝗮𝗱𝗮𝘁𝗮 𝗳𝗶𝗹𝘁𝗲𝗿𝗶𝗻𝗴 𝘄𝗼𝗿𝗸 𝗶𝗻 𝗥𝗔𝗚? Metadata filtering enhances search by applying filters. For example, let’s say you’re building a job listing website where jobs are constantly posted with unstructured descriptions. These job listings all have filters including location, pay, and employment duration that you can use to isolate all relevant jobs related to the user’s query. 🤗 A user searching for jobs in Los Angeles with a minimum wage of $35 per hour can find listings more accurately using metadata filtering than with semantic searches alone. 𝗦𝗼, 𝘄𝗵𝗮𝘁 𝗮𝗿𝗲 𝘁𝗵𝗲 𝗯𝗲𝗻𝗲𝗳𝗶𝘁𝘀 𝗼𝗳 𝗺𝗲𝘁𝗮𝗱𝗮𝘁𝗮 𝗳𝗶𝗹𝘁𝗲𝗿𝗶𝗻𝗴? ✅ Enhanced precision by narrowing down search results to meet specific metadata criteria. ✅ Retrieve relevant documents faster by reducing the surface area of the query, especially in large datasets. 𝗧𝗶𝗽𝘀 𝘄𝗵𝗲𝗻 𝘂𝘀𝗶𝗻𝗴 𝗺𝗲𝘁𝗮𝗱𝗮𝘁𝗮 𝗳𝗶𝗹𝘁𝗲𝗿𝗶𝗻𝗴! 1️⃣ Choose relevant metadata. Your filtering will only be as good as how well you tag the information so select metadata attributes 🐸 that provide the most value for your specific use case. For technical documentation, relevant metadata might include document type, programming language, and last updated date. 2️⃣ Experiment with automating metadata tagging. Use libraries like spaCY and NLTK to automatically extract metadata fields. 3️⃣ Store metadata efficiently. Use trufflepig to tag metadata upon upload and rely on our managed service for scalability. 😁 To help improve building RAG, we added metadata filtering as a feature this week. With trufflepig, you can leverage filtering for greater customization + better data organization, so try it out with our updated docs. 👀 Check out trufflepig’s metadata filtering: (link in first comment) 🤩 Follow trufflepig for more RAG tips and updates! Thanks for the support!
-
Wondering how to logically partition unstructured data for RAG? What do you group together vs. separately? 1️⃣ 𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱 𝘆𝗼𝘂𝗿 𝘂𝘀𝗲 𝗰𝗮𝘀𝗲. When building a retrieval system for unstructured datasets in a RAG application, it’s crucial to optimize your vector store for relevant and performant search. 2️⃣ 𝗖𝗼𝗻𝘀𝗶𝗱𝗲𝗿 𝘁𝗵𝗲 𝗰𝗼𝗺𝗽𝗮𝗿𝗶𝘀𝗼𝗻𝘀 𝗯𝗲𝗶𝗻𝗴 𝗺𝗮𝗱𝗲. Semantic search begins by sorting documents using a distance metric (e.g., cosine similarity) against the query. Think about the type of comparisons you need. Let’s use an example scenario about product recommendations. Suppose you have a product catalog with descriptions and user reviews. A two-stage search might be best: First stage – retrieve relevant products using product descriptions. Second stage – generate recommendations based on user reviews of the relevant products. For a query like “comfortable pants for travel,” first match the query to an index of product descriptions to find relevant products 👖 (pants, trousers, leggings, etc.). Next, filter these results to search user reviews for qualitative feedback. In this example, we are making certain assumptions about the data. 👐 Product descriptions and user reviews are likely quite different in structure and semantic similarity. 👌 Product descriptions are more objective, matter-of-fact representations of the product. 🤝 Reviews are more opinionated, qualitative representations of the product. 3️⃣ 𝗔𝘃𝗼𝗶𝗱 𝗻𝗮𝗶𝘃𝗲 𝘃𝗲𝗰𝘁𝗼𝗿 𝘀𝗲𝗮𝗿𝗰𝗵. Naive vector search over combined descriptions and reviews can yield noisy, irrelevant results. Instead, create separate indexes for different data types. 4️⃣ 𝗟𝗲𝘃𝗲𝗿𝗮𝗴𝗲 𝗺𝗲𝘁𝗮𝗱𝗮𝘁𝗮 𝗳𝗶𝗹𝘁𝗲𝗿𝗶𝗻𝗴. Use metadata attributes to link documents across indexes when possible. This improves search relevance (and often performance) while maintaining relationships between related documents. By strategically grouping your data and using metadata filtering, you can enhance the relevance and performance of your RAG system. Follow trufflepig 🐷 for more RAG tips and product updates! Try trufflepig: https://lnkd.in/gW_-zD-q
-
Interested in perfecting chunking for RAG? 🙅♂️ Stop relying on fixed chunk sizes with overlap. Using fixed chunk sizes might seem convenient with frameworks like LlamaIndex or LangChain, but it can disrupt your data’s structure and lead to subpar embeddings. Fixed chunks don’t account for natural data boundaries, causing arbitrary cutoffs. 👉 Introducing overlap can help maintain continuity and preserve structure. But, combining fixed chunk sizes with overlap can introduce noise and redundancy, making your LLM produce worse responses. Simply setting ‘chunk_size = 512’ and ‘chunk_overlap = 50’ is often insufficient. This approach can waste tokens and reduce efficiency, creating bottlenecks in your pipeline. 🤔 Many developers now opt for a more context-aware approach, like adaptive chunking, to adjust the chunk size dynamically based on semantic boundaries. For example, leveraging NLP tools like spaCy or NLTK to detect sentence or paragraph boundaries for chunks that are semantically meaningful. However, these approaches can introduce complexity and require careful tuning to avoid data fragmentation. 🐷 With trufflepig, we intelligently index your data, preserving its original structure. Our advanced chunking strategy reduces noise and improves your RAG performance. 👍 Follow trufflepig for more AI + RAG content and let us know what you think in the comments! 🤗 Try trufflepig: https://lnkd.in/gDSJveW5
-
Considering building an AI app? Start with a solid business use case. We've received many questions about starting a RAG project, so here’s our thoughts! 🤔 Focus on customers’ needs to drive what you build. Before diving into AI or RAG projects, start with clear business use cases because many rush to apply AI without a true business need. Focus on your customers: What do they need? Is there a reason to apply an LLM to this problem? What data will you use to solve this problem? 🤝 Ready to build an AI app? Here’s what to do next. Develop detailed scenarios outlining the problem, proposed RAG solution, and expected impact to ensure alignment with customer and business goals. Start with user stories that highlight the most critical problems. Share your draft with customers to gain buy-in, reducing the risk of solving non-existent problems and ensuring support for your vision. Building intelligent systems on unstructured data can offer significant ROI, but it varies for each use case and customer. Once you have a clear use case, gather high-quality, relevant datasets. 💪 Start with a strong infrastructure Next, focus on shaping your RAG stack. There’s a lot to consider here starting with the retrieval methods to selecting a model to perform inference. The ideal retrieval methods may vary depending on your use case. If it’s a chat application, you might have to chunk and index your data in a vector store to perform semantic search based on user queries. For other use cases, a simple full-text search might suffice. We suggest investing some time in coming up with an evaluation framework for your retrieval step using metrics such as mean reciprocal rank (MRR) and hit rate. This can be really useful in quickly assessing where improvements in your retrieval can be made, and whether iterations in your retrieval pipeline are having the intended effects on those metrics. Finally, evaluate different LLMs to decide what is best suited for your application. Each model will come with tradeoffs in terms of inference cost, speed, correctness, and overhead in terms of requiring fine-tuning. 🐷 Get your infrastructure correct with trufflepig trufflepig is helpful when building a RAG application by providing out-of-the-box retrieval infrastructure so that you can focus on your application layer. This eliminates the need for extensive tinkering with chunking, storage, or other pre-processing steps, which is a burden with other frameworks. trufflepig allows you to rapidly deliver your proof of concepts faster. 👏 Embrace continuous feedback Your first attempt probably won’t be your last. Metrics are helpful but not definitive. Test the app yourself and then have your customer break it. Perfection is hard to achieve in isolation, and iterating from first principles is incredibly powerful. 👍 Follow trufflepig for more AI content, and share your building experience in the comments. How do you approach your projects? Try trufflepig: (link in comments)
-
🌟 Our blog is live! Dive into our first post about Better GitHub Search. A few weeks ago, we released Better GitHub Search, an app built on the trufflepig API that lets you search GitHub repos using natural language. We can't thank everyone enough for the response; 100+ users in the first hour. 👏 𝗦𝗼, 𝘄𝗲 𝗼𝗽𝗲𝗻-𝘀𝗼𝘂𝗿𝗰𝗲𝗱 𝘁𝗵𝗲 𝗽𝗿𝗼𝗷𝗲𝗰𝘁 𝗮𝗻𝗱 𝘄𝗿𝗼𝘁𝗲 𝗮 𝗯𝗹𝗼𝗴 𝗼𝗻 𝗰𝗿𝗲𝗮𝘁𝗶𝗻𝗴 𝗮 𝗰𝘂𝘀𝘁𝗼𝗺 𝘀𝗲𝗮𝗿𝗰𝗵 𝗲𝗻𝗴𝗶𝗻𝗲 𝗳𝗼𝗿 𝘆𝗼𝘂𝗿 𝗱𝗮𝘁𝗮. In this blog, you’ll learn how to: 1️⃣ Index large amounts of data with trufflepig 2️⃣ Use LLMs to enhance content, like using Haiku to improve README files 3️⃣ Create a search engine 😁 Check out the blog and share your thoughts in the comments. Read the blog here: https://lnkd.in/gUw_RqMc Follow trufflepig for more updates because we have exciting things coming this week 🐷
-
trufflepig reposted this
🤔 Struggling to get all the results you want from your RAG app? Here’s how to fix it! When you have a large collection of documents, getting all the results you need can be challenging. Here’s a common scenario: 1️⃣ You have an index with thousands of documents on similar topics. 2️⃣ You query a specific topic that is mentioned in several documents, but only a few focus extensively on it. 3️⃣ Chunks from the few focused documents dominate your results, even though you want to see relevant information from all the documents. 𝗪𝗲 𝗿𝗲𝗰𝗼𝗺𝗺𝗲𝗻𝗱 𝘀𝗲𝘁𝘁𝗶𝗻𝗴 𝗮 𝗵𝗶𝗴𝗵 𝗞 𝘃𝗮𝗹𝘂𝗲 𝘁𝗼 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗲 𝗯𝗿𝗼𝗮𝗱𝗲𝗿 𝗿𝗲𝘀𝘂𝗹𝘁𝘀. However, this approach can introduce noise. To refine your results, you can use a reranker to cut off the low-ranking results and increase precision. This is how you would do it from scratch, which requires experimentation to find what works best. Keep in mind that this will increase latency. With trufflepig, you don't need to compromise on latency to get the results you need. 👋 Have you faced similar challenges? Share your thoughts in the comments and follow trufflepig for more AI content 🐷 Try trufflepig: (https://lnkd.in/gdW98A38)
-
🤔 Struggling to get all the results you want from your RAG app? Here’s how to fix it! When you have a large collection of documents, getting all the results you need can be challenging. Here’s a common scenario: 1️⃣ You have an index with thousands of documents on similar topics. 2️⃣ You query a specific topic that is mentioned in several documents, but only a few focus extensively on it. 3️⃣ Chunks from the few focused documents dominate your results, even though you want to see relevant information from all the documents. 𝗪𝗲 𝗿𝗲𝗰𝗼𝗺𝗺𝗲𝗻𝗱 𝘀𝗲𝘁𝘁𝗶𝗻𝗴 𝗮 𝗵𝗶𝗴𝗵 𝗞 𝘃𝗮𝗹𝘂𝗲 𝘁𝗼 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗲 𝗯𝗿𝗼𝗮𝗱𝗲𝗿 𝗿𝗲𝘀𝘂𝗹𝘁𝘀. However, this approach can introduce noise. To refine your results, you can use a reranker to cut off the low-ranking results and increase precision. This is how you would do it from scratch, which requires experimentation to find what works best. Keep in mind that this will increase latency. With trufflepig, you don't need to compromise on latency to get the results you need. 👋 Have you faced similar challenges? Share your thoughts in the comments and follow trufflepig for more AI content 🐷 Try trufflepig: (https://lnkd.in/gdW98A38)