Mark Kovarski’s Post

Responsible AI | Co-Founder | CTO | Enterprise | Automation

5mo Edited

𝐉𝐮𝐥𝐲 18, 2024 - 𝐍𝐞𝐰 𝐀𝐈 𝐌𝐨𝐝𝐞𝐥𝐬 𝐫𝐞𝐥𝐞𝐚𝐬𝐞𝐬 A flurry of models arrived today... The idea is to make large models smaller and use synthetic data to make them as good as large models in terms of knowledge memory and acquisition. The future of LLMs is smaller and more focused, not solely larger and more AGI-like. Knowledge will become infinitely free... ⭐ 𝐃𝐞𝐞𝐩𝐒𝐞𝐞𝐤-𝐕2-0628 No.1 open-source model on the LMSYS Chatbot Arena Leaderboard. Detailed Arena Ranking: Overall No.11, Hard Prompts No.3, Coding No.3, Longer Query No.4, Math No.7. 🔗 Release: https://lnkd.in/eSUEs_hx ⭐ 𝐆𝐏𝐓-4𝐨 𝐌𝐢𝐧𝐢 The new GPT 3.5 Turbo: 60% cheaper, and outperforming it on every benchmark. It supports text and vision now, and will support "video and audio inputs and outputs" in the future. It also has a 128k context window. 🔗 Release: https://lnkd.in/evyCrqDr ⭐ 𝐌𝐢𝐬𝐭𝐫𝐚𝐥-𝐍𝐞𝐌𝐨 12𝐁 Trained in collab with NVIDIA. Outperforms Gemma2 9B and Llama3 8B. 128K context. Multilingual in 100+ languages: excels in European, Asian & Indian languages. Quant-Aware Training at FP8. Apache 2.0 License. 🔗 Release: https://lnkd.in/e_weYSv2 🔗 Release (NVIDIA): https://lnkd.in/e6b6y7Zx ⭐ 𝐒𝐧𝐨𝐰𝐟𝐥𝐚𝐤𝐞 𝐀𝐫𝐜𝐭𝐢𝐜 𝐄𝐦𝐛𝐞𝐝 𝐌 𝐯1.5 Super pragmatic text embedding for English-language search! Same small size and Apache 2.0 license as v1.0. Adds excellent vector compression (98% retrieval quality at 4% vector size vs. uncompressed). 🔗 Release: https://lnkd.in/e_7JrM9c ⭐ 𝐆𝐨𝐥𝐝𝐅𝐢𝐧𝐜𝐡 Combines the best parts of Linear Attention (via RWKV) and traditional Transformers to create something that is better than either one on its own. Hybrids are hot these days... Very capable of long context + good at recall tasks. 🔗 Abs: https://lnkd.in/ehKsdDk8 ⭐ 𝐀𝐑𝐂𝐄𝐄 𝐀𝐈 𝐌𝐨𝐝𝐞𝐥 Nova is a merge of Qwen2-72B-Instruct with a custom model tuned on a generalist dataset mixture. Approaches GPT-4 (May 2023) performance levels. Key Capabilities: Reasoning, Creative Writing, Coding, General Language Understanding. 🔗 Release: https://lnkd.in/eW5peje4 ⭐ 𝐌𝐢𝐱𝐞𝐝𝐛𝐫𝐞𝐚𝐝 𝐃𝐞𝐞𝐩𝐬𝐞𝐭-𝐦𝐱𝐛𝐚𝐢-𝐞𝐦𝐛𝐞𝐝-𝐝𝐞-𝐥𝐚𝐫𝐠𝐞-𝐯1 Open-source German/English embedding model. Fine-tuned on 30+ million pairs of high-quality German & English data. Optimized for retrieval tasks. 🔗 Release: https://lnkd.in/eANkiAeJ Pretraining scale synthetic data is the next frontier. In just 16 months, the price has dropped by 88% from GPT-3.5 Turbo-0301 to GPT-4o mini. Intelligence is becoming more affordable, making AI accessible to everyone. Cheaper tokens for everyone!

2 Comments

Uven Moodley

5mo

Thanks for sharing

1 Reaction

Uven Moodley

5mo

Great advice!

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Ajay S.

CoFounder /Chief Technology Officer @ Innovation Hacks AI | Applied Data Science
4w
Report this post
XGrammar: Breaking Through the Complexity of Structured Generation for LLMs 🚀 With the rise of Large Language Models (LLMs), structured generation has emerged as a critical capability. Today’s LLMs must go beyond human-like text to produce structured outputs such as JSON, SQL, and other domain-specific formats. Applications in code generation, robotic control, and structured querying rely on these capabilities—but maintaining speed and efficiency while generating such outputs remains a major challenge. 🔍 The Challenge: Despite rapid advancements, structured output generation is still inefficient in many ways. Traditional methods like context-free grammar (CFG) interpretation require evaluating every possible token, often exceeding 128,000 options, and maintaining recursive grammar states. This adds significant latency and computational complexity, making it unsuitable for real-time or large-scale applications. 🛠️ Introducing XGrammar: Researchers from Carnegie Mellon University, NVIDIA, Shanghai Jiao Tong University, and UC Berkeley have developed XGrammar, a revolutionary structured generation engine that addresses these limitations. Here’s how: Token Categorization: XGrammar divides tokens into context-independent (prevalidated) and context-dependent (runtime-evaluated), reducing the computational burden during output generation. Over 99% of tokens can be prevalidated! Co-Designed Grammar and Inference Engine: By overlapping grammar computations with GPU-based LLM operations, XGrammar minimizes overhead and achieves near-zero latency for structured generation. Innovative Techniques: Byte-Level Pushdown Automaton: Efficiently handles irregular token boundaries and nested structures. Adaptive Token Mask Cache: Precomputes validity for context-independent tokens, reducing memory requirements by over 99.8%, from 160 MB to just 0.46 MB. ⚡ The Performance Gains: 100x Speedup in CFG processing compared to traditional methods. 80x Improvement in end-to-end structured output generation on NVIDIA H100 GPU. Reduced Memory Footprint: Storage requirements slashed to just 0.2% of the original size. 💡 Accuracy vs. Efficiency: While XGrammar delivers groundbreaking efficiency improvements, the challenges of ensuring accuracy in structured generation remain. Complex, structured formats demand meticulous token selection and validation, and although XGrammar reduces runtime checks for most tokens, the context-dependent tokens still require careful processing to maintain output integrity. The accuracy loss here depends on how well the grammar and context-dependent evaluations are managed, but XGrammar's innovative architecture seeks to minimize these trade-offs effectively. 💬 Structured generation is a complex challenge—are tools like XGrammar the breakthrough the field needs? Share your thoughts below! 🤖✨
Like Comment
To view or add a comment, sign in
⚡Shawn Lim

VP, Platform and AI | FinTech | Architecting on serverless and building useful products
8mo Edited
Report this post
Things to consider when you are building an LLM-enabled app with RAG. “”” Retrieval-augmented generation (RAG) has become a common pattern for extending the capabilities of large language models (LLMs). RAG is simple in theory (just add data to the context window!) but complex in practice. Hidden beyond the box diagrams are advanced chunking strategies, reranking, multi-query-retrievers, small-to-big retrieval, hypothetical document embeddings, pre-embedding data enrichment, dynamic routing, custom embedding models … and on and on. While setting up an initial pipeline could be fast and easy, getting to production-level quality is significantly more complex. Without careful consideration, RAG systems can return incorrect, irrelevant, or inconsistent information; they can struggle with poor performance, inefficiently consume expensive resources, or choke when scaling to production-scale source data. Understanding how to effectively and efficiently evaluate the quality of RAG systems requires understanding how all the individual pieces work together to create a full RAG pipeline. Design decisions for each of these pieces can impact quality and should be understood by everyone attempting to deploy RAG applications “”” #softwaredesign #softwareengineering #softwarearchitecture #ai

RAG for Quality Engineers

medium.com
Like Comment
To view or add a comment, sign in
Elvis S.

Cofounder & CEO at DAIR.AI | Ph.D. | Prev: Meta AI, Galactica LLM, Elastic | Prompting Guide (6M+ learners) | I teach how to build with AI ⬇️
5mo Edited
Report this post
That's right! It's a huge week for small language models (SLMs) Few new SLMs on my radar: 1) Mistral NeMo Highlights: - Introduced by Mistral + NVIDIA - Apache 2.0 license - outperforms Gemma 2 9B and Llama 3 8B - multilingual capabilities - efficient tokenizer (Tekken) 2) GPT-4o mini Highlight: "15 cents per million input tokens, 60 cents per million output tokens, MMLU of 82%, and fast." 3) SmolLM Highlight: "SmolLM models: 135M, 360M, and 1.7B parameters; Smollm-Corpus curated from Cosmopedia v2, FineWeb-Edu, and Stack-Edu-Python." 4) Mathstral and Codestral Mamba Highlights: - Mathstral achieves 56.6% on MATH and 63.47% on MMLU. - Codestral Mamba tested on in-context retrieval capabilities up to 256k tokens and shown to be quite efficient due to Mamba architecture. 5) H2O Danube3 Highlight: "After final tuning, they show strong performance on academic, chat, and fine-tuning benchmarks. H2O-Danube3 is efficient enough to run on modern smartphones, allowing local and fast processing on mobile devices." Will be doing a YT video on this including capabilities and interesting ways to apply SLMs. Stay tuned! https://lnkd.in/e2WS8ksJ Any others?

6 Comments
Like Comment
To view or add a comment, sign in
Aniruddha Roy (PhD)

Senior NLP Researcher.
5mo
Report this post
a future trends

Elvis S.

Cofounder & CEO at DAIR.AI | Ph.D. | Prev: Meta AI, Galactica LLM, Elastic | Prompting Guide (6M+ learners) | I teach how to build with AI ⬇️
5mo Edited

That's right! It's a huge week for small language models (SLMs) Few new SLMs on my radar: 1) Mistral NeMo Highlights: - Introduced by Mistral + NVIDIA - Apache 2.0 license - outperforms Gemma 2 9B and Llama 3 8B - multilingual capabilities - efficient tokenizer (Tekken) 2) GPT-4o mini Highlight: "15 cents per million input tokens, 60 cents per million output tokens, MMLU of 82%, and fast." 3) SmolLM Highlight: "SmolLM models: 135M, 360M, and 1.7B parameters; Smollm-Corpus curated from Cosmopedia v2, FineWeb-Edu, and Stack-Edu-Python." 4) Mathstral and Codestral Mamba Highlights: - Mathstral achieves 56.6% on MATH and 63.47% on MMLU. - Codestral Mamba tested on in-context retrieval capabilities up to 256k tokens and shown to be quite efficient due to Mamba architecture. 5) H2O Danube3 Highlight: "After final tuning, they show strong performance on academic, chat, and fine-tuning benchmarks. H2O-Danube3 is efficient enough to run on modern smartphones, allowing local and fast processing on mobile devices." Will be doing a YT video on this including capabilities and interesting ways to apply SLMs. Stay tuned! https://lnkd.in/e2WS8ksJ Any others?
Like Comment
To view or add a comment, sign in
Ankit Abhishek

Certified Data Engineer | Data Science Enthusiast | Former Mobile Application Developer @ Tech Mahindra | Passionate About Building Scalable Data Solutions
1mo
Report this post
🚀 Breaking Through Bottlenecks in LLM Inference with Mistral.rs! As large language models (LLMs) continue to evolve, one challenge keeps growing: the need for speed. LLMs bring immense potential, but they often require substantial computational resources, impacting both cost and user experience—especially in time-sensitive scenarios. Enter Mistral.rs, a new platform designed for faster, more accessible LLM inference without sacrificing accuracy. Here’s why it's a game-changer: 🔹 Device Compatibility: Supports a variety of devices—from high-end GPUs to CPUs and even Apple silicon. 🔹 Quantization for Efficiency: Mistral.rs uses GGML and GPTQ techniques to reduce model size and boost speed while retaining high accuracy. 🔹 Memory Optimization: Features like continuous batching and PagedAttention handle large datasets more efficiently, minimizing out-of-memory issues. On an A10 GPU, for example, Mistral-7b hits 86 tokens/second with 4_K_M quantization—a remarkable boost! For developers and data engineers focused on deploying LLMs at scale, Mistral.rs offers a streamlined, cost-effective solution that bridges the gap between performance and practicality. The platform’s API compatibility with OpenAI makes integration a breeze. Curious to explore more on how Mistral.rs can transform real-world AI applications? Let’s discuss! #AI #LLM #DataEngineering #MachineLearning #Inference

Mistral.rs: A Fast LLM Inference Platform Supporting Inference on a Variety of Devices, Quantization, and Easy-to-Use Application with an Open-AI API Compatible HTTP Server and Python Bindings

https://www.marktechpost.com
Like Comment
To view or add a comment, sign in
Farwa Khizar

Machine Learning Scientist @Comparables.ai | NLP | Search & Recommendation
9mo
Report this post
🚀 1-bit LLM is here! Microsoft introduced a 1-bit LLM variant, namely 𝐁𝐢𝐭𝐍𝐞𝐭 𝐛1.58 in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. What does it mean? It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. 📌 𝐊𝐞𝐲 𝐏𝐨𝐢𝐧𝐭𝐬: 🔹 Built on BitNet architecture, which is a Transformer that replaces nn.Linear with BitLinear. It is trained from scratch, with 1.58-bit weights and 8-bit activations. 🔹 It drastically reduces the computational costs and energy consumption by employing integer addition instead of floating-point operations. 🔹 BitNet's smaller memory footprint speeds up inference, making it faster and more efficient. 🔹 Despite using fewer bits, BitNet b1.58 delivers performance on par with high-precision models. 💡 𝐇𝐨𝐰 𝐢𝐬 𝐁𝐢𝐭𝐍𝐞𝐭 𝐝𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐭 𝐟𝐫𝐨𝐦 𝐐𝐮𝐚𝐧𝐭𝐢𝐳𝐚𝐭𝐢𝐨𝐧? In quantization, we typically decrease the precision of weights from 32 bits to 8 bits, reducing size by 4 times. However, in 1-bit LLM, each weight is either -1, 0, or 1, making it unsuitable for post-training quantization. Instead, it requires training a 1.5-bit model from the scratch. 🔒 Although the model isn't public yet, its potential impact on AI is huge. Reference: https://lnkd.in/dhgtwaXM

2402.17764.pdf

arxiv.org

6 Comments
Like Comment
To view or add a comment, sign in
Hao Hoang

💻 AI Software Engineer at Spartan
5mo Edited
Report this post
🚀 LLMs quantization with EfficientQAT 🚀 Introducing Large Language Models (LLMs) quantization with EfficientQAT. This innovative algorithm has made it possible to create a 2-bit INT Llama-2-70B model that outperforms its full-precision (FP) counterpart, Llama-2-13B, while using less memory. Key Highlights: • EfficientQAT achieves comparable performance to vector quantization methods like AQLM and QUIP#, without the deployment challenges. • A 2-bit quantized Llama-2-70B model was trained on a single A100-80GB GPU in just 41 hours. • The accuracy degradation is less than 3% compared to full precision (69.48 vs. 72.41). • Remarkably, this 2-bit model surpasses the Llama-2-13B model in accuracy (69.48 vs. 67.81) with a reduced memory footprint (19.2GB vs. 24.2GB). For those interested in exploring this further, the code is available at GitHub - https://lnkd.in/gU_udPZY. Paper: https://lnkd.in/gJppyBsr This advancement not only reduces the memory requirements but also speeds up both training and inference times.
Like Comment
To view or add a comment, sign in
Ben's Bites

1,935 followers
5mo
Report this post
Mistral and Nvidia drop a new open-source LLM that's small but mighty Mistral AI's latest AI model, NeMo, is here to shake things up. It's small but mighty, packing a 128k context window and some impressive multilingual skills. Oh, and it's open-source. Time to get coding! What does this mean? Mistral NeMo is kinda big (12B parameters) compared to its peers (Gemma 2 9B and Llama 3 8B). But Mistral thinks that local machines can still run it and get actual stuff done, not just use open-source as a play toy. It's got a massive 128k token context window (that's a lot of room for chat) Performs like a champ on reasoning, knowledge, and coding tasks Speaks a bunch of languages fluently (not just English, folks) Uses a fancy new tokenizer called Tekken that's super efficient with different languages and code Comes in both pre-trained and instruction-tuned flavors Licensed under Apache 2.0, so it's free for research and commercial use Oh, and it's quantization-aware, meaning you can run it in FP8 without losing performance. Nerdy, but cool. It’s available at the usual places like HuggingFace and Mistral’s La Plateforme, as well as as a package on NVIDIA’s NIM microservice. Why should I care? If you're into AI (and who isn't these days?), this is big news. Mistral NeMo brings near top-tier performance in a smaller, more efficient package. This means: Easier and cheaper to run for smaller companies and researchers Better multilingual support for global applications Potential for more diverse and creative AI applications due to its open-source nature. For developers, it's a drop-in replacement for Mistral 7B, so upgrading should be a breeze. And for the open-source AI community, it's another step towards democratizing powerful language models. Time to play with some new toys! Read more here: https://lnkd.in/eMvVWbT7

Mistral NeMo

mistral.ai
Like Comment
To view or add a comment, sign in
Jay R.

LLMs @ NVIDIA AI
6mo
Report this post
NVIDIA's NV-Embed sets new state-of-the-art #records on the Massive Text Embedding Benchmark (MTEB). NV-Embed, is a generalist #embedding model that significantly enhances the performance of large language models for text embedding and retrieval tasks. It was trained entirely on publicly available #data without using any proprietary synthetic samples from cutting-edge models like #GPT-4 and achieved an overall score of 69.32 across 56 tasks. Key highlights: - A novel latent attention layer for pooling #token embeddings, outperforming mean pooling and last token methods - Removing the causal attention mask during training, allowing bidirectional context modeling - A two-stage contrastive instruction-tuning method that first focuses on retrieval tasks before blending non-retrieval datasets - Outperforms traditional mean pooling and last-token methods This #research highlights the versatility of large language models as powerful text embedding engines, with applications in open-domain QA, semantic search, and retrieval-augmented generation. The model is open-sourced on the #huggingface hub. HF Hub: https://lnkd.in/gktAp5Bt Research Paper: https://lnkd.in/gmhni9Sk
2 Comments
Like Comment
To view or add a comment, sign in
Antonio Gulli

Google Sr Director, Distinguished Eng, AI, Cloud, Search, CTO Office. HAM: HB9IAZ IU5SKA. Angel Investor
9mo
Report this post
Answer Ai is revolutionizing the world of language models with their latest project. They have developed an open-source system that can train a 70b large language model on a regular desktop computer with two or more standard migaming GPUs (RTX 3090 or 4090). This breakthrough system is the result of a collaboration between Answer.AI, Tim Dettmers (U Washington), and #HuggingFace’s Titus von Köller and Sourab Mangrulkar. By combining #FSDP and #QLoRA, Answer.AI has made it possible to take huge models to new heights locally. This means that even small labs can now access gigantic, hundreds of billions of parameter models. The big idea behind this project is to use cheaper, lower-memory gaming GPUs to train the best available open-source models. The goal is to train a 70 billion parameter (70b) model using only gaming GPUs. This presents a challenge, as each parameter normally takes 16 bits (2 bytes), so that’s 70*2=140GB to even store the weights – and that’s without including all the other data such as activations, gradients, and optimization state! However, with Answer.AI’s new system, this goal is achievable. This breakthrough has opened up new possibilities for language models, and we can't wait to see what's next for Answer.AI and their collaborators. Check out their first project here: https://lnkd.in/dRiCjgXT

You can now train a 70b language model at home – Answer.AI

answer.ai

8 Comments
Like Comment
To view or add a comment, sign in

4,071 followers

View Profile Connect

Mark Kovarski’s Post

More from this author

Responsible AI (RAI): The Imperative of Responsible Artificial Intelligence

Future-Proofing Privacy: Securing AI, LLMs and Data with Homomorphic Encryption

Transforming Cybersecurity in the Age of Large Language Models (LLMs) and Generative AI

Explore topics