KVSharer: A Plug-and-Play Machine Learning Method that Shares the KV Cache between Layers to Achieve Layer-Wise Compression In recent times, large language models (LLMs) built on the Transformer architecture have shown remarkable abilities across a wide range of tasks. However, these impressive capabilities usually come with a significant increase in model size, resulting in substantial GPU memory costs during inference. The KV cache is a popular method used in LLM inference. It saves the previously calculated keys and values in the attention process, which can then be reused to speed up future steps, making the inference process faster overall. Read the full article: https://lnkd.in/e8qQTS3H Paper: https://lnkd.in/eDEf3tsK
Marktechpost Media Inc.’s Post
More Relevant Posts
-
Exciting strides in AI efficiency! 🚀 The introduction of MInference demonstrates the potential of dynamic sparse attention to significantly reduce latency in long-context LLMs. At Boltzmann, we're inspired by these advancements and committed to integrating such innovations to make AI more accessible and practical for all. 🤝 ⚡ https://lnkd.in/egC2MD2y
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
hqjiang.com
To view or add a comment, sign in
-
🚀 𝗥𝗲𝗰𝗲𝗻𝘁 𝗛𝗶𝗴𝗵𝗹𝗶𝗴𝗵𝘁𝘀 𝗶𝗻 𝗔𝗜 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵 🚀 𝗚𝗿𝗼𝗸𝗙𝗼𝗿𝗺𝗲𝗿: Graph Fourier Kolmogorov-Arnold Transformers Introducing a novel Graph Transformer network that surpasses conventional self-attention mechanisms. 1. Tackles limitations of existing GTs in modeling complex node label patterns. 2. Incorporates learnable activation functions in the graph spectrum. 3. Demonstrates superior performance on multiple node and graph classification datasets. Paper: https://lnkd.in/gDKKk-2x 𝗦𝗧𝗔𝗥 𝗔𝗧𝗧𝗘𝗡𝗧𝗜𝗢𝗡: Efficient LLM Inference Over Long Sequences Recent advancements in Large Language Models (LLMs) now handle contexts extending to millions of tokens. 1. New attention mechanism designed to address computational challenges during inference while preserving accuracy. 2. Opens up new applications, including repository-level code analysis, multi-document summarization, and large corpus retrieval. Paper: https://lnkd.in/gtj29ghz 𝗚𝗹𝗶𝘁𝗰𝗵 𝗧𝗼𝗸𝗲𝗻𝘀 𝗶𝗻 𝗟𝗮𝗿𝗴𝗲 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀: 1. Categorization Taxonomy and Effective Detection. 2. Addressing the issue of glitch tokens that lead to hallucinations and erratic behavior in LLMs. 3.Introduction of GlitchHunter, a novel clustering-based detection method. 4.Constructs a Token Embedding Graph (TEG) and uses clustering for efficient detection. 5.High precision and recall across multiple LLMs. Paper: https://lnkd.in/gTVr8jgK #MachineLearning #AIResearch #GraphTransformers #LLM #DeepLearning
2411.17116v1
arxiv.org
To view or add a comment, sign in
-
A couple of weeks back some research papers surfaced on using RNN with improvement as possible enhancement/ replacement to the Transformer network, Another surfaces from Google the RecurrentGemma which is based on Griffin Architecture(Hawk). Hawk is an innovative recurrent neural network (RNN) featuring gated linear recurrences, designed to excel in handling long sequences of data. It surpasses the performance of existing RNN models like Mamba on various tasks while maintaining efficient training and scalability. With its advanced architecture, Hawk offers superior capabilities in processing extensive sequences, making it a powerful tool for tasks requiring nuanced understanding and analysis of large datasets. Hawk uses a special technique called gated linear recurrences, which helps it perform even better than another advanced RNN model called Mamba on various tasks. Griffin is another model we've developed, which is a mix of gated linear recurrences and something called local attention. This combination makes Griffin as good as another powerful RNN model called Llama-2, even though Griffin was trained on much less data. What's even cooler is that Griffin can understand and work with sequences of data that are much longer than what it was trained on. This means it can handle really long pieces of text or sequences of events without breaking a sweat. Both Hawk and Griffin are efficient when it comes to using computer hardware during training, and they're even faster than other models during actual use. We've scaled up Griffin to a whopping 14 billion parameters, and we've figured out how to train our models across many computers efficiently. In summary, our new RNN models, Hawk and Griffin, are super smart and efficient. They work great on long sequences of data, are easy to train, and scale well even on huge amounts of data. https://lnkd.in/dbUBeeYs #generativeai #llm
recurrentgemma-report.pdf
storage.googleapis.com
To view or add a comment, sign in
-
💥💥💥 Pushing the Limits of Large Language Model Quantization via the Linearity Theorem Vladimir Malinovskii, Andrei Panferov, Ivan Ilin, Han Guo, Peter Richtárik, Dan Alistarh Abstract Quantizing large language models has become a standard way to reduce their memory and computational costs. Typically, existing methods focus on breaking down the problem into individual layer-wise sub-problems, and minimizing per-layer error, measured via various metrics. Yet, this approach currently lacks theoretical justification and the metrics employed may be sub-optimal. In this paper, we present a "linearity theorem" establishing a direct relationship between the layer-wise ℓ2 reconstruction error and the model perplexity increase due to quantization. This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, which outperforms all prior data-free approaches such as the extremely popular NF4 quantized format, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels which match a given compression constraint in the medium-bitwidth regime, obtained by reduction to dynamic programming. On the practical side, we demonstrate improved accuracy-compression trade-offs on Llama-3.1 and 3.2-family models, as well as on Qwen-family models. Further, we show that our method can be efficiently supported in terms of GPU kernels at various batch sizes, advancing both data-free and non-uniform quantization for LLMs. 👉 https://lnkd.in/d3ik9bhg #machinelearning
To view or add a comment, sign in
-
Large Language Models (LLMs) are a type of artificial intelligence designed to process and understand human language. Here's how they work: Architecture: LLMs typically use a transformer-based architecture, which consists of: 1. Encoder: Processes input text into numerical representations (embeddings). 2. Decoder: Generates output text based on these embeddings. 3. Self-Attention Mechanism: Allows the model to focus on specific parts of the input. Training: LLMs are trained on massive datasets of text using: 1. Masked Language Modeling: Some input tokens are randomly replaced with masks. 2. Next Sentence Prediction: Model predicts whether two sentences are adjacent. 3. Self-Supervised Learning: Model learns to predict missing tokens. #LLM
To view or add a comment, sign in
-
💥💥💥 Pushing the Limits of Large Language Model Quantization via the Linearity Theorem Vladimir Malinovskii, Andrei Panferov, Ivan Ilin, Han Guo, Peter Richtárik, Dan Alistarh Abstract Quantizing large language models has become a standard way to reduce their memory and computational costs. Typically, existing methods focus on breaking down the problem into individual layer-wise sub-problems, and minimizing per-layer error, measured via various metrics. Yet, this approach currently lacks theoretical justification and the metrics employed may be sub-optimal. In this paper, we present a "linearity theorem" establishing a direct relationship between the layer-wise ℓ2 reconstruction error and the model perplexity increase due to quantization. This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, which outperforms all prior data-free approaches such as the extremely popular NF4 quantized format, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels which match a given compression constraint in the medium-bitwidth regime, obtained by reduction to dynamic programming. On the practical side, we demonstrate improved accuracy-compression trade-offs on Llama-3.1 and 3.2-family models, as well as on Qwen-family models. Further, we show that our method can be efficiently supported in terms of GPU kernels at various batch sizes, advancing both data-free and non-uniform quantization for LLMs. 👉 https://lnkd.in/d3ik9bhg #machinelearning
To view or add a comment, sign in
-
Depth-wise convolutions are often a significant bottleneck in deep learning vision inference as they aren't mapped to GPU tensor cores and run inefficiently with existing frameworks/runtimes. The PolyBlocks compiler, with its wholly automatic and portable code generation, provides significant improvements on such convolutions over Torch eager and Torch inductor (torch.compile) -- over 2.2x as fast. The sizes benchmarked here are from popular vision models (EfficientNet, ConvNeXt, MNASNet). Find more about PolyBlocks here: https://lnkd.in/gbBb7MeM Please email us at [email protected] if there are specific models you'd like us to try with PolyBlocks.
To view or add a comment, sign in
-
Yesterday, Meta released the Llama3.1 herd of models with the 405B variant being on par with GPT4 on a lot of evals. They also released a whopping 92-page long (!) research paper detailing the architecture and training recipe. I've read through it and highlighted some of the interesting aspects in this blog. TL;DR - three variants - 8B, 7B and 405B - trained on a dataset of 15T tokens (compared to 1.8T for Llama2) - multilingual, code and tool-use capabilities - vision and speech adapters for multimodality https://lnkd.in/gDK_724T
Llama3: The 405-Billion Parameter Behemoth
aimaximus.substack.com
To view or add a comment, sign in
-
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection https://lnkd.in/ggbeKD-J The paper proposes a new technique called GaLore (Gradient Low-Rank Projection) for training large language models (LLMs) in a memory-efficient manner. The key idea is to reduce the memory footprint of the gradients during the backward pass, which is a significant memory bottleneck in LLM training. GaLore works by projecting the gradients onto a lower-rank subspace using a low-rank approximation technique. This approximation reduces the memory required to store the gradients, enabling training of larger models with limited GPU memory. The main steps of GaLore are: During the forward pass, activations are computed and checkpointed (stored) as usual. During the backward pass, instead of directly computing and storing the full-rank gradients, GaLore computes a low-rank approximation of the gradients. The low-rank approximation is obtained by projecting the gradients onto a lower-dimensional subspace using a randomized singular value decomposition (SVD) technique. This low-rank approximation of the gradients is then used for the parameter updates, instead of the full-rank gradients. The authors argue that this low-rank projection introduces minimal accuracy loss while significantly reducing the memory footprint of the gradients, enabling training of larger LLMs with limited GPU memory. The paper presents theoretical analysis and empirical evaluations of GaLore, demonstrating its memory-efficiency and negligible accuracy degradation compared to traditional full-rank gradient updates. The authors also discuss extensions of GaLore, such as adaptive rank selection and gradient recomputation, to further improve its performance and memory-efficiency. Overall, GaLore is proposed as a memory-efficient technique for training large language models, addressing the memory bottleneck caused by storing large gradients during the backward pass. By projecting gradients onto a low-rank subspace, it enables training of larger models on hardware with limited memory resources
To view or add a comment, sign in
-
📢 DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads🚀 In this work, they tackle the significant memory and computational challenges associated with deploying long-context large language models (LLMs). The proposed framework, DuoAttention, introduces a novel distinction between Retrieval Heads and Streaming Heads to optimize the use of the KV cache, reducing memory usage by up to 2.55× and improving decoding speeds by up to 2.18×, all while maintaining high accuracy. Technical Highlights: 1. Retrieval Heads vs. Streaming Heads: DuoAttention divides attention heads into two categories. Retrieval Heads are crucial for processing long contexts, whereas Streaming Heads focus on recent tokens, allowing for a more efficient use of memory resources. This division helps in reducing the computational load without compromising model accuracy. 2. KV Cache Optimizatio: The framework uses selective caching strategies, applying full KV cache to Retrieval Heads while maintaining a constant-length cache for Streaming Heads. This results in significant reductions in both memory usage and latency during long-context inference. 3. Scalable Long-Context Inference: With DuoAttention, models like Llama-3-8B can handle up to 3.3 million tokens on a single A100 GPU. This is made possible through careful optimization of memory allocation and the use of lightweight algorithms to identify the critical heads. 4. Performance Improvements: DuoAttention achieves up to 2.55× memory reduction for multi-head attention (MHA) models and up to 1.67× for grouped-query attention (GQA) models. It also speeds up decoding by up to 2.18× for MHA models and up to 1.50× for GQA models, providing significant benefits for real-world applications involving large language models. 5. Compatibility with Quantization: The method is fully compatible with quantization techniques, such as 8-bit weight and 4-bit KV cache quantization. This compatibility further boosts efficiency, enabling the model to handle larger contexts with reduced hardware requirements. With DuoAttention, the team has achieved long-context capabilities for models like Llama-3-8B, enabling handling of up to 3.3 million tokens on a single A100 GPU. This advancement marks a significant step towards more efficient, scalable LLM deployments! 🔍 Paper: https://lnkd.in/dDFHimSA #LLMs #MachineLearning # #EfficientInference
To view or add a comment, sign in
5,806 followers