Elvis S.’s Post

Cofounder & CEO at DAIR.AI | Ph.D. | Prev: Meta AI, Galactica LLM, Elastic | Prompting Guide (6M+ learners) | I teach how to build with AI ⬇️

5mo Edited

That's right! It's a huge week for small language models (SLMs) Few new SLMs on my radar: 1) Mistral NeMo Highlights: - Introduced by Mistral + NVIDIA - Apache 2.0 license - outperforms Gemma 2 9B and Llama 3 8B - multilingual capabilities - efficient tokenizer (Tekken) 2) GPT-4o mini Highlight: "15 cents per million input tokens, 60 cents per million output tokens, MMLU of 82%, and fast." 3) SmolLM Highlight: "SmolLM models: 135M, 360M, and 1.7B parameters; Smollm-Corpus curated from Cosmopedia v2, FineWeb-Edu, and Stack-Edu-Python." 4) Mathstral and Codestral Mamba Highlights: - Mathstral achieves 56.6% on MATH and 63.47% on MMLU. - Codestral Mamba tested on in-context retrieval capabilities up to 256k tokens and shown to be quite efficient due to Mamba architecture. 5) H2O Danube3 Highlight: "After final tuning, they show strong performance on academic, chat, and fine-tuning benchmarks. H2O-Danube3 is efficient enough to run on modern smartphones, allowing local and fast processing on mobile devices." Will be doing a YT video on this including capabilities and interesting ways to apply SLMs. Stay tuned! https://lnkd.in/e2WS8ksJ Any others?

6 Comments

Mohammed Faheem

ML/NLP Engineer | Building Generative AI Applications🚀 | Linux user (arch btw)

5mo

I'm pretty sure that Mistral NeMo beat *Llama 3 8B* not 2nd iteration of llama.

2 Reactions

Alex Irina Sandu

AI Product Management 🔶 Business Strategy 🔶 ML & AI Market Analysis 🔶 ex-Mozilla

5mo

The DCLM models from Apple were published on HF - https://huggingface.co/apple/DCLM-7B

2 Reactions

Michael Spencer

A.I. Writer, researcher and curator - full-time Newsletter publication manager.

5mo

I guess my main interest is how does number one compare with number two technically speaking?

Ngoc-Bien NGUYEN

Expert Data Scientist at MBBANK

5mo

Interesting!

See more comments

To view or add a comment, sign in

More Relevant Posts

Aniruddha Roy (PhD)

Senior NLP Researcher.
5mo
Report this post
a future trends

Elvis S.

Cofounder & CEO at DAIR.AI | Ph.D. | Prev: Meta AI, Galactica LLM, Elastic | Prompting Guide (6M+ learners) | I teach how to build with AI ⬇️
5mo Edited

That's right! It's a huge week for small language models (SLMs) Few new SLMs on my radar: 1) Mistral NeMo Highlights: - Introduced by Mistral + NVIDIA - Apache 2.0 license - outperforms Gemma 2 9B and Llama 3 8B - multilingual capabilities - efficient tokenizer (Tekken) 2) GPT-4o mini Highlight: "15 cents per million input tokens, 60 cents per million output tokens, MMLU of 82%, and fast." 3) SmolLM Highlight: "SmolLM models: 135M, 360M, and 1.7B parameters; Smollm-Corpus curated from Cosmopedia v2, FineWeb-Edu, and Stack-Edu-Python." 4) Mathstral and Codestral Mamba Highlights: - Mathstral achieves 56.6% on MATH and 63.47% on MMLU. - Codestral Mamba tested on in-context retrieval capabilities up to 256k tokens and shown to be quite efficient due to Mamba architecture. 5) H2O Danube3 Highlight: "After final tuning, they show strong performance on academic, chat, and fine-tuning benchmarks. H2O-Danube3 is efficient enough to run on modern smartphones, allowing local and fast processing on mobile devices." Will be doing a YT video on this including capabilities and interesting ways to apply SLMs. Stay tuned! https://lnkd.in/e2WS8ksJ Any others?
Like Comment
To view or add a comment, sign in
Hao Hoang

💻 AI Software Engineer at Spartan
5mo Edited
Report this post
🚀 LLMs quantization with EfficientQAT 🚀 Introducing Large Language Models (LLMs) quantization with EfficientQAT. This innovative algorithm has made it possible to create a 2-bit INT Llama-2-70B model that outperforms its full-precision (FP) counterpart, Llama-2-13B, while using less memory. Key Highlights: • EfficientQAT achieves comparable performance to vector quantization methods like AQLM and QUIP#, without the deployment challenges. • A 2-bit quantized Llama-2-70B model was trained on a single A100-80GB GPU in just 41 hours. • The accuracy degradation is less than 3% compared to full precision (69.48 vs. 72.41). • Remarkably, this 2-bit model surpasses the Llama-2-13B model in accuracy (69.48 vs. 67.81) with a reduced memory footprint (19.2GB vs. 24.2GB). For those interested in exploring this further, the code is available at GitHub - https://lnkd.in/gU_udPZY. Paper: https://lnkd.in/gJppyBsr This advancement not only reduces the memory requirements but also speeds up both training and inference times.
Like Comment
To view or add a comment, sign in
Rajesh Das

Senior software engineer (Gen AI) @ Happiest Minds| Ex-Data scientist @ TVS motor | Mathematics | Data science | Quantum Computing
6mo
Report this post
𝐋𝐨𝐑𝐀 𝐋𝐚𝐧𝐝: 310 𝐅𝐢𝐧𝐞-𝐭𝐮𝐧𝐞𝐝 𝐋𝐋𝐌𝐬 𝐭𝐡𝐚𝐭 𝐑𝐢𝐯𝐚𝐥 𝐆𝐏𝐓-4, 𝐀 𝐓𝐞𝐜𝐡𝐧𝐢𝐜𝐚𝐥 𝐑𝐞𝐩𝐨𝐫𝐭 Low Rank Adaptation (LoRA) has emerged as one of the most widely adopted methods for Parameter Efficient Fine-Tuning (PEFT) of Large Language Models (LLMs). LoRA reduces the number of trainable parameters and memory usage while achieving comparable performance to full fine-tuning. The paper shows that 4-bit LoRA fine-tuned models outperform base models by 34 points and GPT-4 by 10 points on average. This paper also introduces LoRAX, an open-source Multi-LoRA inference server that facilitates the deployment of multiple LoRA fine-tuned models on a single GPU using shared base model weights and dynamic adapter loading. 𝐋𝐨𝐑𝐀𝐗 𝐩𝐨𝐰𝐞𝐫𝐬 𝐋𝐨𝐑𝐀 𝐋𝐚𝐧𝐝, a web application that hosts 25 LoRA fine-tuned Mistral-7B LLMs on a single NVIDIA A100 GPU with 80GB memory. 𝐋𝐨𝐑𝐀 𝐋𝐚𝐧𝐝 highlights the quality and cost-effectiveness of employing multiple specialized LLMs over a single, general-purpose LLM. LoRA Land - https://lnkd.in/epmgenNN Happiest Minds Technologies , Happiest Minds Generative AI , Sridhar Mantha , Praveen R P , Srikant Sowmyanarayanan

LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report

arxiv.org
Like Comment
To view or add a comment, sign in
Ankit Abhishek

Certified Data Engineer | Data Science Enthusiast | Former Mobile Application Developer @ Tech Mahindra | Passionate About Building Scalable Data Solutions
1mo
Report this post
🚀 Breaking Through Bottlenecks in LLM Inference with Mistral.rs! As large language models (LLMs) continue to evolve, one challenge keeps growing: the need for speed. LLMs bring immense potential, but they often require substantial computational resources, impacting both cost and user experience—especially in time-sensitive scenarios. Enter Mistral.rs, a new platform designed for faster, more accessible LLM inference without sacrificing accuracy. Here’s why it's a game-changer: 🔹 Device Compatibility: Supports a variety of devices—from high-end GPUs to CPUs and even Apple silicon. 🔹 Quantization for Efficiency: Mistral.rs uses GGML and GPTQ techniques to reduce model size and boost speed while retaining high accuracy. 🔹 Memory Optimization: Features like continuous batching and PagedAttention handle large datasets more efficiently, minimizing out-of-memory issues. On an A10 GPU, for example, Mistral-7b hits 86 tokens/second with 4_K_M quantization—a remarkable boost! For developers and data engineers focused on deploying LLMs at scale, Mistral.rs offers a streamlined, cost-effective solution that bridges the gap between performance and practicality. The platform’s API compatibility with OpenAI makes integration a breeze. Curious to explore more on how Mistral.rs can transform real-world AI applications? Let’s discuss! #AI #LLM #DataEngineering #MachineLearning #Inference

Mistral.rs: A Fast LLM Inference Platform Supporting Inference on a Variety of Devices, Quantization, and Easy-to-Use Application with an Open-AI API Compatible HTTP Server and Python Bindings

https://www.marktechpost.com
Like Comment
To view or add a comment, sign in
Reuven Cohen

Agentic Engineer / aiCTO / Consultant
9mo
Report this post
New Tutorial: Run & Finetune any open LLM on Google for Free. Learn how to use llama-cpp with LlamaIndex. Yes, for free. I made this for the Ai poor. LLamaCPP is designed for inference with large language models (LLMs) like Meta's LLaMA in C/C++. It allows for minimal setup and high-performance on diverse hardware, both locally and in the cloud. It supports various integer quantization for efficient inference, custom CUDA kernels for NVIDIA GPUs, Vulkan, SYCL, and partial OpenCL backend support. Additionally, it can handle CPU+GPU hybrid inference for models larger than the VRAM capacity. LLamaCPP supports a range of LLMs, including LLaMA, Mistral, Falcon, GPT-2, and multimodal models like LLaVA. It has bindings for multiple languages and features an HTTP server for local model serving compatible with the OpenAI API Supported models: Typically finetunes of the base models below are supported as well. LLaMA 🦙 LLaMA 2 🦙🦙 Mistral 7B Mixtral MoE Falcon Chinese LLaMA / Alpaca and Chinese LLaMA-2 / Alpaca-2 Vigogne (French) Koala Baichuan 1 & 2 + derivations Aquila 1 & 2 Starcoder models Refact Persimmon 8B MPT Bloom Yi models StableLM models Deepseek models Qwen models PLaMo-13B Phi models GPT-2 Orion 14B InternLM2 CodeShell Gemma Multimodal models: LLaVA 1.5 models, LLaVA 1.6 models BakLLaVA Obsidian ShareGPT4V MobileVLM 1.7B/3B models Yi-VL
13 Comments
Like Comment
To view or add a comment, sign in
Brandon Charleson

AI-powered junkie / Clay & Automation Expert / Cold email strategist / Marketing & Growth Advisor / Headhunter
5mo
Report this post
Another wild week in the AI LLM space... Open source LLMs are coming in hot and as outstanding contenders to the closed-LLMs like OpenAI and Anthropic. Mistral AI just released their Mistral Large 2 LLM which is a 123B parameter model that is performing on par with Llama 3.1 405B (which was JUST announced). Just this week, Mark Zuckerberg had a great interview with Rowan Cheung, where he talks about open source and the importance of it compared to closed LLMs (link in comments below). Another excellent point, which I often test and talk about, is the innovative company Groq and how their LPU technology is helping reduce inference and electricity costs drastically. Recommend checking Groq out for sure! ⚡️ Fun fact: Did you know that the LLMs run an INSANE amount of power? Most LLMs operate on GPU's (think your biggest video game, but much much bigger...) According to Jonathan Ross of Groq, NVIDIA's GB200 uses TWICE as much power as your house. In the great context of Elon Musk, "let that sink in..." ⚡️ Why YOU should 👀 out for this technology: It is never a better time to learn the ropes of prompt engineering, how to interact with LLMs, and how it can apply to everything in your life (personal, business, etc..). If you apply these tactics and skills towards sales, engineering/coding, you can take this to the moon overnight (literally and pun intended).

Large Enough

mistral.ai

2 Comments
Like Comment
To view or add a comment, sign in
Antonio Gulli

Google Sr Director, Distinguished Eng, AI, Cloud, Search, CTO Office. HAM: HB9IAZ IU5SKA. Angel Investor
9mo
Report this post
Answer Ai is revolutionizing the world of language models with their latest project. They have developed an open-source system that can train a 70b large language model on a regular desktop computer with two or more standard migaming GPUs (RTX 3090 or 4090). This breakthrough system is the result of a collaboration between Answer.AI, Tim Dettmers (U Washington), and #HuggingFace’s Titus von Köller and Sourab Mangrulkar. By combining #FSDP and #QLoRA, Answer.AI has made it possible to take huge models to new heights locally. This means that even small labs can now access gigantic, hundreds of billions of parameter models. The big idea behind this project is to use cheaper, lower-memory gaming GPUs to train the best available open-source models. The goal is to train a 70 billion parameter (70b) model using only gaming GPUs. This presents a challenge, as each parameter normally takes 16 bits (2 bytes), so that’s 70*2=140GB to even store the weights – and that’s without including all the other data such as activations, gradients, and optimization state! However, with Answer.AI’s new system, this goal is achievable. This breakthrough has opened up new possibilities for language models, and we can't wait to see what's next for Answer.AI and their collaborators. Check out their first project here: https://lnkd.in/dRiCjgXT

You can now train a 70b language model at home – Answer.AI

answer.ai

8 Comments
Like Comment
To view or add a comment, sign in
Ben's Bites

1,935 followers
5mo
Report this post
Mistral and Nvidia drop a new open-source LLM that's small but mighty Mistral AI's latest AI model, NeMo, is here to shake things up. It's small but mighty, packing a 128k context window and some impressive multilingual skills. Oh, and it's open-source. Time to get coding! What does this mean? Mistral NeMo is kinda big (12B parameters) compared to its peers (Gemma 2 9B and Llama 3 8B). But Mistral thinks that local machines can still run it and get actual stuff done, not just use open-source as a play toy. It's got a massive 128k token context window (that's a lot of room for chat) Performs like a champ on reasoning, knowledge, and coding tasks Speaks a bunch of languages fluently (not just English, folks) Uses a fancy new tokenizer called Tekken that's super efficient with different languages and code Comes in both pre-trained and instruction-tuned flavors Licensed under Apache 2.0, so it's free for research and commercial use Oh, and it's quantization-aware, meaning you can run it in FP8 without losing performance. Nerdy, but cool. It’s available at the usual places like HuggingFace and Mistral’s La Plateforme, as well as as a package on NVIDIA’s NIM microservice. Why should I care? If you're into AI (and who isn't these days?), this is big news. Mistral NeMo brings near top-tier performance in a smaller, more efficient package. This means: Easier and cheaper to run for smaller companies and researchers Better multilingual support for global applications Potential for more diverse and creative AI applications due to its open-source nature. For developers, it's a drop-in replacement for Mistral 7B, so upgrading should be a breeze. And for the open-source AI community, it's another step towards democratizing powerful language models. Time to play with some new toys! Read more here: https://lnkd.in/eMvVWbT7

Mistral NeMo

mistral.ai
Like Comment
To view or add a comment, sign in
Daniil Zadiran

DevOps Engineer/ SRE
7mo Edited
Report this post
🍏 OpenELM (https://lnkd.in/g_bW3MnK) 🤗 https://lnkd.in/gijYHXiq Apple is visionary not only in designing products. They are starting to share their futuristic vision in area of developing artificial intelligence. And the main difference between well-known models like ChatGPT, Bard, Gemini, and the new OpenELM is that it uses only local models, enabling users to interact with open-source small transformer models through prompt-based requests. The OpenELM series has varieties with different sets of parameters: 270 million, 450 million, 1.1 billion, and 3 billion, depending on task complexity. OpenELM, a family of Open-source Efficient Language Models, is built with the CoreNet library (https://lnkd.in/g4waX4zA). Users of OpenELM could just use smaller, compact models to resolve: • Privacy and security problems (runs locally, not sending any data to Sam Altman). • Independence from network issues. Also, it reduces data transmission costs (no internet traffic needed). • Processing speed is much higher compared to cloud models; it improves speed and efficiency because you don't need to wait for an answer from the server. Also, it has some shortcomings: • The efficiency will be limited by compact chipsets, so you can't process large context or scale models. • Limited data access (no external sources via web or available APIs). • Could be hard to do manual updates in the case that the model is isolated on the device. Despite this minuses, I think Apple OpenELM have a great potential to become powerful AI development tool and a starting point in popularizing the trend of more compact and small local models, especially for Mac and iPhone users. #openelm #chatgpt #ai #technology #LLM

GitHub - CarperAI/OpenELM: Evolution Through Large Models

github.com
Like Comment
To view or add a comment, sign in
Nagendra Singh

Credit Manager Corporate & MSME @ PLP
5mo
Report this post
Today OpenAI has launched its most cost efficient small model GPT-4o mini, hoping it will definitely expand the range of applications built with AI making intelligence more affordable. Reasoning : 82% on MMLU Mathematics & Coding : 87% on MGSM Multimodal Reasoning : 59.4% on MMMU Pricing : 15 cent per 1M Input Token/ 60 cent per 1M Output Token Context Window : 128K Tokens support upto 16K Tokens GPT-4o mini enables a broad range of tasks with low cost and latency such as application that chain or parallelize multiple models e.g. calling multiple APIs. It supports text & vision in the API with support for text,image,video,audio. It is trained on the knowledge upto October 2023. GPT-4o mini scores 82% on MMLU & outperforms GPT-4 and 60% more cheaper to GPT-3.5 GPT-4o mini in the API is the first model to apply instruction hierarchy which prevents Jailbreaks, Prompt Injection, System prompt extraction.
Like Comment
To view or add a comment, sign in

67,485 followers

View Profile Follow

Elvis S.’s Post

More from this author

My Favorite LLM Papers for October

Tracking LLMs with Comet

How To Build a Custom Chat LLM on Your Data

Explore topics