Very interesting details in the Gemma 2 paper now up on arXiV including e.g. ablation results on some of the new techniques introduced in this version of the model. https://lnkd.in/giX7Uwmj
Xavier (Xavi) Amatriain’s Post
More Relevant Posts
-
In this episode, we discuss More Agents Is All You Need by Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, Deheng Ye. The study demonstrates that the effectiveness of large language models (LLMs) improves when more instances of the model (agents) are used in a simple sampling-and-voting technique. This technique can be combined with other advanced methods to further improve LLM performance, especially for more challenging tasks. Extensive experimentation across various benchmarks confirms these results, and the researchers have made their code accessible to the public.
arxiv preprint - More Agents Is All You Need
podbean.com
To view or add a comment, sign in
-
In this episode, we discuss System 2 Attention (is something you might need too) by Jason Weston, Sainbayar Sukhbaatar. The paper introduces System 2 Attention (S2A), an approach that improves Transformer-based Large Language Models by regenerating input contexts to focus on relevant information before processing, thereby enhancing the generation of the next token. S2A was created to address the problem of standard soft attention mechanisms that often integrate distracting information into outputs. In testing, S2A demonstrated superior performance by producing more factual, objective, and less biased responses on tasks such as question answering, math word problems, and longform content generation.
arxiv preprint - System 2 Attention (is something you might need too)
podbean.com
To view or add a comment, sign in
-
In this episode, we discuss Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention by Tsendsuren Munkhdalai, Manaal Faruqui, Siddharth Gopal. The paper presents a novel method for enabling Transformer-based Large Language Models to process extremely long inputs while keeping memory and computational requirements fixed. The technique introduced, called Infini-attention, blends a new form of memory-augmented attention with local and linear long-term attention within a single Transformer layer. The effectiveness of this method is demonstrated through impressive performance on long-context challenges, including a one million length sequence task and a half-million word book summarization, while maintaining efficient streaming capabilities and a minimal increase in memory parameters.
arxiv preprint - Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
podbean.com
To view or add a comment, sign in
-
https://t.co/iRQ9ofcW7q Ensembles of LLM agents are effective, and more, diverse agents is better. This tracks with what we know of ensembles generally- that adding an uncorrected weak estimator to an ensemble improves the ensemble. Next advances diwn this line will include random agents and some dynamically composed agents, possibly with a differentiable expression of "viewpoint" with regard to the target space. This will enable the equivalent of random forests and gradient boosted trees with agents as estimators.
More Agents Is All You Need
arxiv.org
To view or add a comment, sign in
-
Are Transformers still dominant? Yes, but competition is heating up. State Space Models (SSMs), like Mamba, are closing the gap. A new framework, State Space Duality (SSD), unifies SSMs and attention models, leading to faster, competitive architectures like Mamba-2. With recent LSTM advancements, SSMs are strong challengers. Check out the original post for more details!
Is the Transformer's reign coming to an end? Not yet. But its competitors are closing the gap. State Space Models (SSMs) have been making waves with year with architectures like Mamba matching or even outperforming Transformers at small to medium scale. But are these two model families really that different? A new paper suggests otherwise. The researchers propose a framework called State Space Duality (SSD) that unifies SSMs and variants of attention through various decompositions of structured semiseparable matrices. Phew, what a sentence. By leveraging these insights, they designed Mamba-2, a new architecture that refines Mamba's selective SSM core layer to be 2-8X faster while maintaining competitive performance with Transformers on language modeling tasks. Together with the LSTM revival (mLSTM & xLSTM) that I covered recently, SSMs are shaping up to become serious competitors. 2024 will continue to be heated, don't you think? ↓ Liked this post? Join my newsletter with 25k+ readers that breaks down all you need to know about the latest LLM research: llmwatch.com 💡
To view or add a comment, sign in
-
Is the Transformer's reign coming to an end? Not yet. But its competitors are closing the gap. State Space Models (SSMs) have been making waves with year with architectures like Mamba matching or even outperforming Transformers at small to medium scale. But are these two model families really that different? A new paper suggests otherwise. The researchers propose a framework called State Space Duality (SSD) that unifies SSMs and variants of attention through various decompositions of structured semiseparable matrices. Phew, what a sentence. By leveraging these insights, they designed Mamba-2, a new architecture that refines Mamba's selective SSM core layer to be 2-8X faster while maintaining competitive performance with Transformers on language modeling tasks. Together with the LSTM revival (mLSTM & xLSTM) that I covered recently, SSMs are shaping up to become serious competitors. 2024 will continue to be heated, don't you think? ↓ Liked this post? Join my newsletter with 25k+ readers that breaks down all you need to know about the latest LLM research: llmwatch.com 💡
To view or add a comment, sign in
-
Learn more about MonoFormer, One Transformer for Both Diffusion and Autoregression. Also learn about LLM hallucination by seeing what o1-mini has to say about MonoFormer. arXiv link to the paper: https://lnkd.in/gTd-bWwi Hallucination Station: https://lnkd.in/gk2iskT4 Twitter (X) thread on MonoFormer: https://lnkd.in/g2K2RD-N
Thread by @_akhaliq on Thread Reader App
threadreaderapp.com
To view or add a comment, sign in
-
In this episode, we discuss Memory Mosaics by Jianyu Zhang, Niklas Nolte, Ranajoy Sadhukhan, Beidi Chen, Léon Bottou. Memory Mosaics are collective networks designed for prediction tasks, utilizing associative memories in a collaborative manner. These networks offer a simpler and more transparent alternative to transformers, maintaining comparable abilities in compositional learning and learning in context. The effectiveness of Memory Mosaics is established through medium-scale language modeling experiments, outperforming or matching the performance of transformers.
arxiv preprint - Memory Mosaics
podbean.com
To view or add a comment, sign in
-
Check out our new paper - Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines What if we could visualize language models’ computation process - with images? The Diffusion Lens passes intermediate representations of the text encoder directly to the diffusion model, to use as guidance for the image generation process. As a result, the Diffusion Lens allows us to take snapshots of the text encoding process, layer by layer. Using the Diffusion Lens, we analyzed two open-source text-to-image models, Stable Diffusion and DeepFloyd, revealing insights about their conceptual combination and their memory (knowledge) retrieval abilities. This was a great collaboration with Michael Toker Hadas Orgad Mor Ventura and Yonatan Belinkov Paper: https://lnkd.in/d6CXH6r5 Demo: https://lnkd.in/djn7su5a
To view or add a comment, sign in
-
In this episode, we discuss LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning by Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Zirui Liu, Chia-Yuan Chang, Huiyuan Chen, Xia Hu. The paper presents SelfExtend, a novel method for extending the context window of Large Language Models (LLMs) to better handle long input sequences without the need for fine-tuning. SelfExtend incorporates bi-level attention mechanisms to manage dependencies between both distant and adjacent tokens, allowing LLMs to operate beyond their original training constraints. The method has been tested comprehensively, showing its effectiveness, and the code is shared for public use, addressing the key challenge of LLMs' fixed sequence length limitations during inference.
arxiv preprint - LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
podbean.com
To view or add a comment, sign in