AtScale Open-Sourced Semantic Modeling Language (SML): Transforming Analytics with Industry-Standard Framework for Interoperability, Reusability, and Multidimensional Data Modeling Across Platforms AtScale has made a significant move by announcing the open-source release of its Semantic Modeling Language (SML). This initiative aims to provide an industry-standard semantic modeling language that can be adopted across various platforms, fostering greater collaboration and interoperability in the analytics community. The introduction of SML marks a major step in the company’s decade-long journey of democratizing data analytics and advancing semantic layer technology. SML offers several benefits to developers and organizations: 🔰 Object-Oriented Structure: SML is designed to be object-oriented, so its semantic objects can be reused across different models, promoting consistency and efficiency in model building. 🔰 Comprehensive Scope: It is a superset of existing semantic modeling languages, incorporating more than a decade’s experience and use cases across different verticals. This makes SML versatile enough to cater to a wide range of applications. 🔰 Familiar Syntax: SML is built on YAML, a widely adopted, human-readable syntax, making it easier for developers to adopt the language without steep learning curves. 🔰 CI/CD Friendly: Being code-based, SML integrates well with modern software development practices, including Git for version control, and supports continuous integration and continuous deployment (CI/CD) workflows. 🔰 Extensibility and Open Access: SML is Apache open-sourced, which means it is free to use and can be extended by the community. This open nature allows for innovation and collaboration, ensuring the language evolves to meet new demands. Read our full take on this: https://lnkd.in/geYKRkcA GitHub: https://lnkd.in/gabaJSDx Details: https://lnkd.in/gRNeAVic AtScale David P. Mariani Bri Pedersen Rouzbeh Safaie Jeffrey Curran #opensource #ai
Asif Razzaq’s Post
More Relevant Posts
-
Idefics3-8B-Llama3 Released: An Open Multimodal Model that Accepts Arbitrary Sequences of Image and Text Inputs and Produces Text Outputs Machine learning models integrating text and images have become pivotal in advancing capabilities across various applications. These multimodal models are designed to process and understand combined textual and visual data, which enhances tasks such as answering questions about images, generating descriptions, or creating content based on multiple images. They are crucial for improving document comprehension and visual reasoning, especially in complex scenarios involving diverse data formats. The core challenge in multimodal document processing involves handling and integrating large volumes of text and image data to deliver accurate and efficient results. Traditional models often need help with latency and accuracy when managing these complex data types simultaneously. This can lead to suboptimal performance in real-time applications where quick and precise responses are essential. Existing techniques for processing multimodal inputs generally involve separate analyses of text and images, followed by a fusion of the results. These methods can be resource-intensive and may only sometimes yield the best outcomes due to the intricate nature of combining different data forms. Models such as Apache Kafka and Apache Flink are used for managing data streams, but they often require extensive resources and can become unwieldy for large-scale applications. To overcome these limitations, HuggingFace Researchers have developed Idefics3-8B-Llama3, a cutting-edge multimodal model designed for enhanced document question answering. This model integrates the SigLip vision backbone with the Llama 3.1 text backbone, supporting text and image inputs with up to 10,000 context tokens. The model, licensed under Apache 2.0, represents a significant advancement over previous versions by combining improved document QA capabilities with a robust multimodal approach. Idefics3-8B-Llama3 utilizes a novel architecture that effectively merges textual and visual information to generate accurate text outputs. The model’s 8.5 billion parameters enable it to handle diverse inputs, including complex documents that feature text and images. The enhancements include better handling of visual tokens by encoding images into 169 visual tokens and incorporating extended fine-tuning datasets like Docmatix. This approach aims to refine document understanding and improve overall performance in multimodal tasks. Performance evaluations show that Idefics3-8B-Llama3 marks a substantial improvement over its predecessors. The model achieves a remarkable 87.7% accuracy in DocVQA and a 55.9% score in MMStar, compared to Idefics2’s 49.5% in DocVQA and 45.2% in MMMU. These results indicate significant enhancements in handling document-based queries and visual reasoning. The new model’s ability to manage up to 10,000 tokens of context and its integration ...
To view or add a comment, sign in
-
7 November: AI Builders Meetup for nerds that like to go deep and technical. Every first Thursday of the month at StartDock Amsterdam. Featured speakers: 🗣 Unstract (remote talk + QA) Unstract provides LLM Whisperer, one of the world's best PDF parsing APIs, capable of extracting handwritten documents, checkboxes, images, tables, and more exactly as they are. With this clean unstructured data at hand, why not explore their flagship product, Unstract? With this No-code tool for LLM structured extraction building data processing ETL pipelines has never been easier. 🗣 July Jagt The story of Social Technology Lab, founded by four close friends united by a shared belief. Having known each other for years, they recognize AI's immense potential but feel that use cases with true positive impact remain underexplored. Their mission is to change this by creating a lab to experiment with cutting-edge tech for the public good. They aim to inspire fellow builders by sharing their projects, revealing their journey, and showcasing their future direction. https://lnkd.in/eg5eyteP 🗣 Kaspar Rothenfusser Providing LLMs with effective ways to navigate large dataspaces of real-world objects presents numerous challenges. Semantic embeddings struggle to capture non-semantic relations of real-world objects and obscure their true meaning through learned (unexplainable) dimensions. Knowledge graphs, while useful, are difficult to scale, maintain, and populate, inherently limiting flexibility. Thoughtful-Oasis aims to combine the best of all worlds in its LLM-first Data Navigation Framework. By using domain-specific embedding vectors with explainable dimensions, both LLMs and humans can navigate millions of real-world objects through parameters relevant to the domain at hand. https://lnkd.in/e8SYtHDw 🗣 Roger Ter Heide ImproVive will demonstrate their technology for creating lifelike virtual characters. They use game development software, language models, and information retrieval to build interactive digital humans for training and storytelling purposes. Their system combines voice recognition, text-to-speech, and 3D character models to enable natural conversations. The presentation will explain how their platform generates dynamic dialogue, recognizes the user's position in virtual environments, and allows for quick adjustments without coding. If you're interested in using interactive digital characters for applications like info displays, education or healthcare you should not miss this one! https://improvive.com/ TICKETS (be fast, we usually sell out) 🎫 https://lnkd.in/edTYWfUg
AI Builders Monthly Amsterdam · Luma
lu.ma
To view or add a comment, sign in
-
Vector database A Vector database is a database which stores information as vectors, which are numerical representations of data objects, also known as vector embeddings. It leverages the power of these vector embeddings to index and search across a massive dataset of unstructured and semi-structured data, like images, text or sensor data. Because of its search capabilities, Vector database is also known as vector search engine. Vector databases are in existence for several years, but they become popular after emergence of Generative AI and later on more popular in late 2022 with announcement of ChatGPT. Unlike traditional databases like relational or NoSQL databases, they provide exact answers to queries while Vector similarity search enables users to find semantically similar texts or images, also known as k-Nearest-Neighbor search. Vector embeddings allow to find and retrieve similar objects from vector database by searching for objects which are close to each other in vector space, which is called vector search, similarity search or semantic search.t Use cases of Vector databases 1) NLP 2) E-commerce product recommendations 3) With the rise of LLMs, vector databases have already been used to enhance modern Generative AI applications. 4) AI/ML applications 5) Image & video recognition and retrieval applications 6) Fraud detection 7) Autonomous vehicles 8) Biometrics 9) Medical Diagnostics Difference between traditional and Vector databases 1) Relational databases are designed for storing structured data in columns while vector databases store unstructured data ( like text, images or audio) and their vector embeddings 2) In relational databases, data is retrieved by keyword matches in traditional search while vector databases enable efficient semantic search 3) Relational databases maintain structured relationships between datasets through predefined schemas , ensuring data integrity . On the other hand, vector databases performs well with high-dimensional and diverse datasets requiring specialized indexing techniques for efficient processing 4) Relational databases prioritize consistency and SQL capabilities while vector databases represent complex relationships dynamically Commonly used methods for measuring similarity between vector embeddings 1) Cosine Similarity – It ranges from -1 to 1. The closer the cosine similarity of two vectors is 1, greater the similarity 2) Euclidean distance – It ranges from zero and infinity. Less value means more similarity 3) Dot Product – Ranging from -infinity to infinity. Positive value indicates similarity between vectors while negative value represent dissimilarity Types of Vector databases 1) Dedicated vector databases – Chroma, Qdrant, Pinecone, Milvus, etc 2) Vector-capable databases – MongoDB, Redis, Neo4j, Elastic Search, PostgreSQL #Vector #Database #embeddings
To view or add a comment, sign in
-
🔊 Llamaindex Query Pipelines: Quickstart Guide to the Declarative API Query Pipelines is a new declarative API to orchestrate simple-to-advanced workflows within LlamaIndex to query your data. Other frameworks have built similar approaches, an easier way to build LLM workflows over your data like RAG systems, query unstructured data, or structured data extraction. 🛠 It’s based on the QueryPipeline abstraction. You can load in an extensive variety of modules (from LLMs to prompts to retrievers to query engines to other pipelines), connect them all into a sequential chain or DAG, and run it end2end. 🚀 You can compose both sequential chains and directed acyclic graphs (DAGs) of arbitrary complexity in a more comprehensive style. Instead of building them imperatively with LlamaIndex modules, Query Pipelines provides you with a more efficient way and with fewer lines of code. 📰 In this Medium article, I cover some use cases, implementing them using Query Pipelines. The examples are clear and intended to be as simple as possible. In the comments, you can find the link to the ⌨ notebook containing the code to build Query Pipelines for: ✅ Simple Chain: Prompt Query + LLM ✅ Query Rewriting Workflow with Retrieval ✅ Simple RAG pipeline ✅ RAG pipeline with Query Rewriting ✅ RAG pipeline with Sentence Window Retrieval ✅ RAG pipeline with Auto Merging Retrieval Some of them have been extracted from the Llamaindex documentation, but some others have been designed for this post: I hope you find it useful and try out this new capability!! 🌈 Follow me for future updates and to support more content like this. https://lnkd.in/dk6svsdt #llm #generativeai #llamaindex #nlp #rag
Llamaindex Query Pipelines: Quickstart Guide to the Declarative Query API
pub.towardsai.net
To view or add a comment, sign in
-
Just built an AI agent using RAG for super accurate data retrieval! #AI #DataScience #TechInnovation - 📂 Cloned the repository from GitHub to kickstart the project. - 🛠️ Set up the MySQL database and Flask API server with Docker Compose. - 📄 Indexed PDFs and stored vectors in ChromaDB using Jupyter Notebook. - 🔍 The agent decides between executing API tools or performing semantic searches based on the prompt. - 🧠 Leveraged OpenAI’s GPT-4 for intelligent function mapping and response generation. - 🌐 Integrated APIs to fetch top-selling products, categories, sales trends, and revenue by category. - 📊 Utilized a vector database for efficient and accurate context retrieval. #tech #AI #DataScience - 🐳 Docker Compose simplifies the setup of database and API server. - 🧩 ChromaDB ensures fast and accurate retrieval of context from indexed PDFs. - 📈 The AI agent dynamically chooses the best method for data retrieval, improving efficiency. - 🛠️ Function calls enhance structured retrieval processes, tailoring responses to application needs. - 📊 APIs provide comprehensive sales data, enriching the AI agent's capabilities. - 🖥️ The Jupyter Notebook environment streamlines the indexing and validation process. - 🧠 Combining RAG with GPT-4 boosts the agent's response accuracy and relevance. https://lnkd.in/gVU7JDUr
How To Build an AI Agent That Uses RAG To Increase Accuracy
https://thenewstack.io
To view or add a comment, sign in
-
On March 7, MLCommons has announced the release of Croissant, a metadata format to help standardize ML datasets. The release includes format documentation, an open-source library, and a visual editor, supported by industry leaders like Hugging Face, Google Dataset Search, Kaggle, and OpenML. See: https://lnkd.in/gsgVmUmB Currently, there's no standardized method for organizing ML datasets, making them hard to find, understand, and use. It can be challenging to reuse existing datasets for training ML models, primarily due to the substantial time required to comprehend the data, its structure, and the appropriate subsets for features. This issue, stemming from the diverse range of data representations and the unique, ad hoc organization within datasets that span text, structured data, images, audio, and video, severely hampers progress in ML, affects productivity across the ML development cycle and hinders the creation of advanced tools for dataset management. While general-purpose metadata formats like schema.org and DCAT for structuring data on the web exist, they do often not meet the specific needs of ML data, such as extracting and combining data from various sources, including metadata for responsible data use, and delineating training, testing, and validation sets. Croissant aims to overcome these challengesw by offering a metadata format to help standardize ML datasets, and enhance the discoverability and usability of datasets across various tools and platforms. Croissant can be seen as a standardized framework that organizes and describes datasets in a clear and uniform manner, streamlining the process of locating and applying datasets for ML projects. It does this by enhancing how datasets can be found online by adding specialized vocabulary to schema.org. This addition, known as the Croissant vocabulary, improves how datasets are indexed and discovered through search engines, such as Google Dataset Search, making it easier for users to find and access the data they need for machine learning and AI projects. Croissant introduces metadata to datasets in a way that doesn't change the original data. This metadata provides a standardized description of the dataset's contents, including key attributes and properties. This way, the datasets can be more uniformly understood and utilized across different platforms and applications. To enhance accessibility and interoperability, dataset creators are encouraged to adopt Croissant for descriptions, hosts to offer Croissant files for download, and tool developers to support Croissant in their data analysis and labeling applications, making datasets more discoverable and user-friendly. Google Blog post by Omar Benjelloun, Software Engineer, Google Research, and Peter Mattson, Software Engineer, Google Core ML and President, MLCommons Association: https://lnkd.in/g-J98NCQ #croissant #metadata
New Croissant Metadata Format helps Standardize ML Datasets - MLCommons
https://mlcommons.org
To view or add a comment, sign in
-
🚀 𝐑𝐞𝐯𝐨𝐥𝐮𝐭𝐢𝐨𝐧𝐢𝐳𝐞 𝐘𝐨𝐮𝐫 𝐃𝐚𝐭𝐚 𝐌𝐚𝐧𝐚𝐠𝐞𝐦𝐞𝐧𝐭 𝐰𝐢𝐭𝐡 𝐕𝐞𝐜𝐭𝐨𝐫 𝐃𝐚𝐭𝐚𝐛𝐚𝐬𝐞𝐬 & 𝐂𝐡𝐫𝐨𝐦𝐚 𝐃𝐁! 🚀 In the era of Big Data and AI, traditional databases just don't cut it anymore. Enter 𝐕𝐞𝐜𝐭𝐨𝐫 𝐃𝐚𝐭𝐚𝐛𝐚𝐬𝐞𝐬—the game-changer that's taking data management to the next level! 🌟 🔍 𝐖𝐡𝐚𝐭 𝐢𝐬 𝐚 𝐕𝐞𝐜𝐭𝐨𝐫 𝐃𝐚𝐭𝐚𝐛𝐚𝐬𝐞? A Vector Database is designed to handle high-dimensional vectors, making it perfect for tasks like similarity search, recommendation systems, and real-time AI applications. It's a database optimized for storing and querying vectorized data, such as embeddings generated from deep learning models. Think of it as a database designed for the AI-first world! 🌐 🌈 𝐖𝐡𝐲 𝐂𝐡𝐫𝐨𝐦𝐚 𝐃𝐁? Meet Chroma 𝐃𝐁—the cutting-edge, open-source Vector Database that's making waves in the AI community! Here’s why Chroma DB is your go-to solution: 𝟏. 🔥 𝐋𝐢𝐠𝐡𝐭𝐧𝐢𝐧𝐠-𝐅𝐚𝐬𝐭 𝐒𝐞𝐚𝐫𝐜𝐡: Perform millions of similarity searches per second, even in datasets with billions of entries. 𝟐. 💡 𝐈𝐧𝐭𝐞𝐥𝐥𝐢𝐠𝐞𝐧𝐭 𝐈𝐧𝐝𝐞𝐱𝐢𝐧𝐠: Chroma DB uses advanced algorithms like Hierarchical Navigable Small World (HNSW) graphs to ensure your data is always accessible, accurate, and up-to-date. 𝟑. 📈 𝐒𝐜𝐚𝐥𝐚𝐛𝐢𝐥𝐢𝐭𝐲 : Built to scale with your needs, Chroma DB can handle everything from small projects to enterprise-level applications without breaking a sweat. 𝟒. 🛠️ 𝐒𝐞𝐚𝐦𝐥𝐞𝐬𝐬 𝐈𝐧𝐭𝐞𝐠𝐫𝐚𝐭𝐢𝐨𝐧 : Chroma DB integrates effortlessly with your existing tech stack, including Python, Java, and more. Plus, it’s open-source—so you can customize it to fit your exact needs! 𝟓. 🌐 𝐂𝐨𝐦𝐦𝐮𝐧𝐢𝐭𝐲-𝐃𝐫𝐢𝐯𝐞𝐧: With a vibrant and growing community, Chroma DB is continuously evolving. Contribute, learn, and grow with the best minds in the industry! **✨ 𝐀𝐦𝐚𝐳𝐢𝐧𝐠 𝐅𝐚𝐜𝐭𝐬 𝐀𝐛𝐨𝐮𝐭 𝐕𝐞𝐜𝐭𝐨𝐫 𝐃𝐚𝐭𝐚𝐛𝐚𝐬𝐞𝐬:** - 𝐀𝐈-𝐏𝐨𝐰𝐞𝐫𝐞𝐝 𝐒𝐞𝐚𝐫𝐜𝐡: Vector DBs like Chroma DB enable AI-powered search engines that understand context and semantics, not just keywords. - 𝐁𝐞𝐲𝐨𝐧𝐝 𝐓𝐞𝐱𝐭: Perfect for handling multimedia data—images, audio, video—anything that can be vectorized can be managed efficiently. - 𝐑𝐞𝐚𝐥-𝐓𝐢𝐦𝐞 𝐑𝐞𝐜𝐨𝐦𝐦𝐞𝐧𝐝𝐚𝐭𝐢𝐨𝐧𝐬: Vector DBs are the backbone of recommendation systems used by tech giants to deliver personalized content in real-time. - 𝐅𝐮𝐭𝐮𝐫𝐞-𝐏𝐫𝐨𝐨𝐟: With the rapid advancements in AI, vector databases are set to become a cornerstone of modern data infrastructures. #artificialintelligence #machinelearning #deeplearning #genAI #database #chromadb #softwareengineer #vectordatabase
To view or add a comment, sign in
-
Big news in a week of big news and a large step forward for the industry. The quality of the input (data) highly correlates to the quality of the output (in this case a response generated by an LLM).
Corporate AI Governance Consulting @Trace3: All Possibilities Live in Technology: Innovating with Responsible AI: I'm passionate about advancing business goals through AI governance, AI strategy, privacy & security.
On March 7, MLCommons has announced the release of Croissant, a metadata format to help standardize ML datasets. The release includes format documentation, an open-source library, and a visual editor, supported by industry leaders like Hugging Face, Google Dataset Search, Kaggle, and OpenML. See: https://lnkd.in/gsgVmUmB Currently, there's no standardized method for organizing ML datasets, making them hard to find, understand, and use. It can be challenging to reuse existing datasets for training ML models, primarily due to the substantial time required to comprehend the data, its structure, and the appropriate subsets for features. This issue, stemming from the diverse range of data representations and the unique, ad hoc organization within datasets that span text, structured data, images, audio, and video, severely hampers progress in ML, affects productivity across the ML development cycle and hinders the creation of advanced tools for dataset management. While general-purpose metadata formats like schema.org and DCAT for structuring data on the web exist, they do often not meet the specific needs of ML data, such as extracting and combining data from various sources, including metadata for responsible data use, and delineating training, testing, and validation sets. Croissant aims to overcome these challengesw by offering a metadata format to help standardize ML datasets, and enhance the discoverability and usability of datasets across various tools and platforms. Croissant can be seen as a standardized framework that organizes and describes datasets in a clear and uniform manner, streamlining the process of locating and applying datasets for ML projects. It does this by enhancing how datasets can be found online by adding specialized vocabulary to schema.org. This addition, known as the Croissant vocabulary, improves how datasets are indexed and discovered through search engines, such as Google Dataset Search, making it easier for users to find and access the data they need for machine learning and AI projects. Croissant introduces metadata to datasets in a way that doesn't change the original data. This metadata provides a standardized description of the dataset's contents, including key attributes and properties. This way, the datasets can be more uniformly understood and utilized across different platforms and applications. To enhance accessibility and interoperability, dataset creators are encouraged to adopt Croissant for descriptions, hosts to offer Croissant files for download, and tool developers to support Croissant in their data analysis and labeling applications, making datasets more discoverable and user-friendly. Google Blog post by Omar Benjelloun, Software Engineer, Google Research, and Peter Mattson, Software Engineer, Google Core ML and President, MLCommons Association: https://lnkd.in/g-J98NCQ #croissant #metadata
New Croissant Metadata Format helps Standardize ML Datasets - MLCommons
https://mlcommons.org
To view or add a comment, sign in
-
🔥 Vector databases might be the wrong tool for your AI applications. Here's why: Managing vector embeddings across multiple databases is a nightmare of sync issues, stale data, and unnecessary complexity. Timescale just released an interesting take: treat embeddings like database indexes through "vectorizers" that automatically stay in sync with your source data. With just one SQL command, you can create and maintain your embeddings - no more juggling multiple databases or writing complex sync logic. Check out how pgai Vectorizer is reimagining vector search in this fascinating deep dive: https://lnkd.in/gk4Aky9g #AI #PostgreSQL #VectorDatabases #DataEngineering #MLOps
Vector Databases Are the Wrong Abstraction
timescale.com
To view or add a comment, sign in
-
Transforming a simple Chat with PDF demo into a production grade RAG application that dynamically reads from ever-growing, real-time indexes is a massive engineering leap! It involves scaling the compute engine as document volumes grow and ensuring real-time updates. In this blog post Prince Krampah shows how to build a RAG application on wikipedia pages - Tensorlake Indexify as a reliable ingestion engine to extract embedding and NER from web pages. LlamaIndex for querying the indexes populated by Indexfiy Mistral AI LLM for response synthesis LanceDB for storing indexes
🚀 Hello everyone! I’m excited to share my latest article on data frameworks for Large Language Models (LLMs) called Indexify from Tensorlake. This time, we’re diving into some hands-on projects that showcase the power of Indexify and how it can enhance your LLM applications. Diptanu Gon Choudhury Tensorlake 🔍 What Is Indexify? Indexify is an open-source data framework featuring a real-time extraction engine and ready-made extraction adapters. It offers reliable out-of-the-box extraction for various unstructured data types, including documents, presentations, videos, and audio. 🛠 Project Overview: In this article, we’ll cover building an online data ingestion pipeline that can: Load data from Wikipedia: Import articles from Wikipedia. Structured Extraction: Chunk Wikipedia documents into smaller sections. Create embeddings for each chunk using Sentence Transformers. Store embeddings in LanceDB. Perform Named Entity Recognition using an LLM. Build a Retrieval-Augmented Generation (RAG) application with OpenAI. Develop a Streamlit RAG application using Llama-index and Mistral. Interact with a Visual UI of the extraction pipeline. 🛠️ This comprehensive guide shows how Indexify seamlessly integrates into your existing LLM and orchestration stack, including tools like Llama-index and LangChain. Ready to dive in? Let’s get started! 🚀 #DataFrameworks #LLM #AI #MachineLearning #Indexify #DataPipeline #OpenSource #RAG #LlamaIndex #LangChain #DataExtraction
Data Framework For LLMs Indexify | RAG Application Project
medium.com
To view or add a comment, sign in
More from this author
-
Newsletter for AI Researchers and Software Developers: Release Date- Oct 23, 2024
Asif Razzaq 1mo -
AI Research Updates: Q-GaLore Released + Lynx + NuminaMath 7B TIR Released + AgentInstruct + and many more...
Asif Razzaq 4mo -
Here are 15 Super 😎 Cool AI Research Papers ALONG with SUMMARY from Microsoft (2024)
Asif Razzaq 8mo