Ramin Mehran’s Post

Tech Lead @ Google DeepMind Multi-Modal perception/generation, AI Breakdown Podcaster

In this episode, we discuss Hymba: A Hybrid-head Architecture for Small Language Models by Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov. The paper introduces Hymba, a new family of small language models that combines transformer attention mechanisms with state space models for enhanced efficiency and performance. It employs a hybrid approach using attention heads and SSM heads for detailed recall and context summarization, along with optimizations like learnable meta tokens, cross-layer KV sharing, and partial sliding window attention to reduce cache size. Experiments show that Hymba-1.5B-Base outperforms other models under 2B parameters, with improvements in accuracy, cache size, and throughput.

Arxiv Paper - Hymba: A Hybrid-head Architecture for Small Language Models

podbean.com

To view or add a comment, sign in

More Relevant Posts

Bhargava Gouda

Attended Sri Sivani College of Engineering, Chilakapalem Junction, Etcherla Mandal, PIN-532402 (CC-W6)
4mo
Report this post
Workshop on Large Language Models (LLMs) I’m taking part in a workshop to learn more about Large Language Models (LLMs). This workshop covers how these models work and their applications. This workshop is designed to explore the intricacies of LLMs, including their architecture, applications, and the latest advancements in the field. It provides a comprehensive understanding of how these models work and their impact on various industries.
Like Comment
To view or add a comment, sign in
Thiago Iplinsky

Software Engineer | Zup Innovation
7mo
Report this post
"This course gives you a synopsis of the 𝐞𝐧𝐜𝐨𝐝𝐞𝐫-𝐝𝐞𝐜𝐨𝐝𝐞𝐫 architecture, which is a powerful and prevalent machine learning architecture for sequence-to-sequence tasks such as machine translation, text summarization, and question answering. "

Encoder-Decoder Architecture

cloudskillsboost.google
Like Comment
To view or add a comment, sign in
Towards Data Science

639,381 followers
3mo Edited
Report this post
Gain hands-on experience working with the vision transformer (ViT) by following along François Porcher's clear, step-by-step tutorial on implementing the architecture from scratch.

How to Train a Vision Transformer (ViT) from Scratch

towardsdatascience.com
Like Comment
To view or add a comment, sign in
Harsha Srivatsa

Founder and AI Product Manager | AI Product Leadership, Data Architecture, Data Products, IoT Products | 7+ years of helping visionary companies build standout AI+ Products | Ex-Apple, Accenture, Cognizant, AT&T, Verizon
5mo
Report this post
YOCO - You Only Cache Once - is an interesting and fascinating Transformer architecture alternative to LLM's . As you know Transformers' language modeling approach is extremely expensive and unorthodox. This forces the companies behind Large Language Models (LLMs) to spend billions, with a b, running these models. YOCO architecture takes another, more interesting approach that, fascinatingly, leads to up to three orders of magnitude (100x) reductions in memory requirements and latency while being competitive, performance-wise, with current models, which sounds like a miracle. I think there is something to this new architecture and holds promise. Read more about it at: https://lnkd.in/g7yu64C8 https://lnkd.in/gBWFWtFk https://lnkd.in/g8E93Vxc #YOCO #TransformerArchitecture #LLMs #LanguageModels #AIResearch #MachineLearning #ArtificialIntelligence #DeepLearning #InnovativeTechnology #MemoryOptimization #LatencyReduction #PerformanceEngineering #TechInnovation #FutureOfAI #AIModeling #TechTrends #AIArchitecture #NeuralNetworks #ComputationalEfficiency
3 Comments
Like Comment
To view or add a comment, sign in
Pavlo Molchanov

Distinguished Scientist and Research Manager @NVIDIA Research
1mo Edited
Report this post
Excited to share a new efficient small language model architecture with parallel Mamba and Attention fusion - Hymba. Details: https://lnkd.in/gFKfdG4h We study the tradeoff between Mamba and Attention, exploring how they can be combined, how the attention sink and forced-to-attend phenomena can be mitigated, and how the KV cache can be shared across layers. The team delivered an end-to-end solution featuring a novel architecture, selecting data, a five-stage training setup, and trained both Base and Instruct models. Release is with open license. A standout feature is that the Hymba-1.5B Base model outperforms LLaMA 3.2-3B, despite being trained on 7× fewer tokens and achieving 12× cache reduction. Model weights are coming soon (hopefully tomorrow). Stay tuned for more details on the architecture tomorrow.
18 Comments
Like Comment
To view or add a comment, sign in
Wonmin Byeon

Researcher
2w
Report this post
Our new Hybrid model is on arxiv!
Pavlo Molchanov

Distinguished Scientist and Research Manager @NVIDIA Research
1mo Edited

Excited to share a new efficient small language model architecture with parallel Mamba and Attention fusion - Hymba. Details: https://lnkd.in/gFKfdG4h We study the tradeoff between Mamba and Attention, exploring how they can be combined, how the attention sink and forced-to-attend phenomena can be mitigated, and how the KV cache can be shared across layers. The team delivered an end-to-end solution featuring a novel architecture, selecting data, a five-stage training setup, and trained both Base and Instruct models. Release is with open license. A standout feature is that the Hymba-1.5B Base model outperforms LLaMA 3.2-3B, despite being trained on 7× fewer tokens and achieving 12× cache reduction. Model weights are coming soon (hopefully tomorrow). Stay tuned for more details on the architecture tomorrow.
Like Comment
To view or add a comment, sign in
Hanoona Abdul Rasheed

PhD Computer Vision Student at MBZUAI
6mo
Report this post
🚀 Excited to announce our new paper, VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding. ✨ Key Contributions: - Hybrid Model Architecture: VideoGPT+ combines the strengths of image and video encoders, ensuring detailed spatial and robust temporal understanding. - Novel Dataset: Introduced a 112K video-instruction set, leveraging a semi-automatic annotation pipeline to boost model performance. - Comprehensive Benchmark: VCGBench-Diverse evaluates video conversation LMMs across 18 video categories with 4,354 Q&A pairs, testing dense captioning, spatial and temporal comprehension, and complex reasoning. Check out our full paper for more insights and results! Paper: https://lnkd.in/dtrYgTPw Code: https://lnkd.in/dYU-JkRD Team: Muhammad Maaz, Salman Khan and Fahad Khan

4 Comments
Like Comment
To view or add a comment, sign in
Khanh Vo Duc

Senior Data Scientist/Machine Learning for IVA applications at NVIDIA
3w
Report this post
This approach is pivotal for developing specialized Small Language Models tailored to each vertical domain.
Pavlo Molchanov

Distinguished Scientist and Research Manager @NVIDIA Research
1mo Edited

Excited to share a new efficient small language model architecture with parallel Mamba and Attention fusion - Hymba. Details: https://lnkd.in/gFKfdG4h We study the tradeoff between Mamba and Attention, exploring how they can be combined, how the attention sink and forced-to-attend phenomena can be mitigated, and how the KV cache can be shared across layers. The team delivered an end-to-end solution featuring a novel architecture, selecting data, a five-stage training setup, and trained both Base and Instruct models. Release is with open license. A standout feature is that the Hymba-1.5B Base model outperforms LLaMA 3.2-3B, despite being trained on 7× fewer tokens and achieving 12× cache reduction. Model weights are coming soon (hopefully tomorrow). Stay tuned for more details on the architecture tomorrow.
Like Comment
To view or add a comment, sign in
qAIntum.ai Inc.

134 followers
4w Edited
Report this post
🚀 Optimizing Next-Gen Sequence Generation Models 🧠💡 The next generation of sequence generation models hinges on optimizing compute-intensive Multi-Headed Attention blocks. This critical innovation drives advancements in efficiency and scalability for Generative AI. Enter Hymba by NVIDIA, leveraging Selective State Space—a groundbreaking approach that’s quickly positioning itself as a leader in compute-efficient Generative AI models. This is a game-changer for unlocking faster, more cost-effective, and high-performing AI solutions. Excited to see how these innovations redefine what's possible in AI! 🌟 #GenerativeAI #AIInnovation #DeepLearning #ComputeEfficiency
Pavlo Molchanov

Distinguished Scientist and Research Manager @NVIDIA Research
1mo Edited

Excited to share a new efficient small language model architecture with parallel Mamba and Attention fusion - Hymba. Details: https://lnkd.in/gFKfdG4h We study the tradeoff between Mamba and Attention, exploring how they can be combined, how the attention sink and forced-to-attend phenomena can be mitigated, and how the KV cache can be shared across layers. The team delivered an end-to-end solution featuring a novel architecture, selecting data, a five-stage training setup, and trained both Base and Instruct models. Release is with open license. A standout feature is that the Hymba-1.5B Base model outperforms LLaMA 3.2-3B, despite being trained on 7× fewer tokens and achieving 12× cache reduction. Model weights are coming soon (hopefully tomorrow). Stay tuned for more details on the architecture tomorrow.
Like Comment
To view or add a comment, sign in
Shubhrangshu Ghosh

TCS Research, IIT-KGP PhD Student, BITS-Pilani Alumni
5mo
Report this post
Is revolution coming in deep learning architecture world? Is it early signal? Look https://lnkd.in/geWuTJkA . It beats Mumba easily. Although in early stage, but it is a potential candidate to change the transformer world. One not so contextual point: This paper is resulted from more than one year effort from the team. Salute to the authors 👏🙏. #deeplearning #RNN #transformer #llm
Karan Dalal

Berkeley
5mo

I’m excited to share a project I’ve been working on for over a year, which I believe will fundamentally change our approach to language models. We’ve designed a new architecture, which unlocks linear complexity architectures with expressive memory, allowing us to train LLMs with millions (someday billions) of tokens in context. Arxiv to learn more: https://lnkd.in/gSCczEkF
Like Comment
To view or add a comment, sign in

3,312 followers

350 Posts

View Profile Connect

Ramin Mehran’s Post

Arxiv Paper - Hymba: A Hybrid-head Architecture for Small Language Models

podbean.com

More Relevant Posts

Explore topics