🚀 𝗥𝗲𝗰𝗲𝗻𝘁 𝗛𝗶𝗴𝗵𝗹𝗶𝗴𝗵𝘁𝘀 𝗶𝗻 𝗔𝗜 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵 🚀 𝗚𝗿𝗼𝗸𝗙𝗼𝗿𝗺𝗲𝗿: Graph Fourier Kolmogorov-Arnold Transformers Introducing a novel Graph Transformer network that surpasses conventional self-attention mechanisms. 1. Tackles limitations of existing GTs in modeling complex node label patterns. 2. Incorporates learnable activation functions in the graph spectrum. 3. Demonstrates superior performance on multiple node and graph classification datasets. Paper: https://lnkd.in/gDKKk-2x 𝗦𝗧𝗔𝗥 𝗔𝗧𝗧𝗘𝗡𝗧𝗜𝗢𝗡: Efficient LLM Inference Over Long Sequences Recent advancements in Large Language Models (LLMs) now handle contexts extending to millions of tokens. 1. New attention mechanism designed to address computational challenges during inference while preserving accuracy. 2. Opens up new applications, including repository-level code analysis, multi-document summarization, and large corpus retrieval. Paper: https://lnkd.in/gtj29ghz 𝗚𝗹𝗶𝘁𝗰𝗵 𝗧𝗼𝗸𝗲𝗻𝘀 𝗶𝗻 𝗟𝗮𝗿𝗴𝗲 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀: 1. Categorization Taxonomy and Effective Detection. 2. Addressing the issue of glitch tokens that lead to hallucinations and erratic behavior in LLMs. 3.Introduction of GlitchHunter, a novel clustering-based detection method. 4.Constructs a Token Embedding Graph (TEG) and uses clustering for efficient detection. 5.High precision and recall across multiple LLMs. Paper: https://lnkd.in/gTVr8jgK #MachineLearning #AIResearch #GraphTransformers #LLM #DeepLearning
Shubham Saket’s Post
More Relevant Posts
-
KVSharer: A Plug-and-Play Machine Learning Method that Shares the KV Cache between Layers to Achieve Layer-Wise Compression In recent times, large language models (LLMs) built on the Transformer architecture have shown remarkable abilities across a wide range of tasks. However, these impressive capabilities usually come with a significant increase in model size, resulting in substantial GPU memory costs during inference. The KV cache is a popular method used in LLM inference. It saves the previously calculated keys and values in the attention process, which can then be reused to speed up future steps, making the inference process faster overall. Read the full article: https://lnkd.in/e8qQTS3H Paper: https://lnkd.in/eDEf3tsK
KVSharer: A Plug-and-Play Machine Learning Method that Shares the KV Cache between Layers to Achieve Layer-Wise Compression
https://www.marktechpost.com
To view or add a comment, sign in
-
Not all 🌍 layers are created equal, some are just 🚫 pure useless... This is what is 📚 discussed in "The Unreasonable Ineffectiveness of the Deeper Layers" 📉 Researchers from Meta, Cisco, Massachusetts Institute of Technology, Sequoia Capital (feels like 💥 avengers assembling) empirically study a simple layer-pruning 🌿 strategy for popular families of open-weight pretrained Large Language Models (LLMs), finding minimal 📉 degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed 🚮. Interestingly, results also suggest that shallow layers are disproportionately 🎯 important for models' results, allowing deletion of deeper layers with minimal impact ⚖️. 👉👉Results: 👉Llama 70B: limited drop in accuracy even after 40% layers removed 📉 👉Llama 13B: limited drop in accuracy even after 50% layers removed 📉 👉Other models: limited drop in accuracy after 20-30% layers removed 📉 To prune these models, they identify the optimal block of layers to prune by considering similarity across layers; then, to “heal” the damage, they perform a small amount of finetuning 🔧. In particular, they use parameter-efficient finetuning (PEFT) methods, specifically quantization and Low Rank Adapters (QLoRA), such that each of their experiments can be performed on a single A100 GPU 💻. From a practical perspective, these results suggest that layer pruning methods can complement other PEFT strategies to further reduce computational resources of finetuning on the one hand 🤲, and can improve the memory and latency of inference on the other hand 🤚. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge 🧠. Link: 👉 https://lnkd.in/ggWEV5q3
To view or add a comment, sign in
-
Introducing Star Attention: A Breakthrough in Long-Sequence Efficiency for LLMs 🌟 Large language models (LLMs) are pushing boundaries with long-context capabilities, yet challenges remain. Processing millions of tokens often means higher costs and slower inference. This is where Star Attention comes in, a novel block-sparse attention mechanism developed by NVIDIA, designed to revolutionize inference over long sequences. Uses a two-phase approach: - Phase 1: Local context blocks are processed efficiently with anchor blocks to approximate global attention. - Phase 2: Global query encoding ensures all relevant tokens are attended to for accurate token generation. Key highlights: - Efficient Inference: Star Attention reduces memory requirements and inference time by up to 11x while preserving 95-100% accuracy. - Scalability: Compatible with Transformer-based LLMs, scaling seamlessly to 1M tokens without additional fine-tuning. - Practical Gains: Enables repository-level code analysis, multi-document summarization, and more. Impact: Star Attention balances computational efficiency with accuracy, setting a new standard for long-context processing. Future research will refine anchor block mechanisms for even greater scalability. Read more here: https://lnkd.in/eM7pRetC #AI #DeepLearning #LLMs #MachineLearning #StarAttention #NVIDIA #Efficiency #Innovation #GenAI #AttentionMechanisms
Star Attention: Efficient LLM Inference over Long Sequences
arxiv.org
To view or add a comment, sign in
-
Scaling AI Models with LASP The paper introduces Linear Attention Sequence Parallelism (LASP), an optimized strategy for handling long sequences in linear attention-based language models. LASP utilizes efficient point-to-point communication and kernel fusion to improve parallelism efficiency and scalability. Extensive experiments demonstrate LASP's ability to scale sequence lengths up to 4096K, making it significantly faster than existing methods for AI model training. #LASP #LinearAttention #SequenceParallelism #AIModelScaling #EfficiencyImprovement #Scalability #KernelFusion #ParallelismEfficiency #AIResearch #TechInnovation
Linear Attention Sequence Parallelism
arxiv.org
To view or add a comment, sign in
-
Hey AI enthusiasts! 👋 Are you ready to dive into the latest advancements in AI model quantization? 🤖 I've curated a list of cutting-edge research papers that push the boundaries of quantization techniques, including state-of-the-art methods and groundbreaking 2-bit quantization approaches. Let's explore together! State-of-the-Art AI Model Quantization Papers: Paper 1: https://lnkd.in/dCaCEWdc Paper 2: https://lnkd.in/dn9RgNG9 Paper 3: https://lnkd.in/dh5XVejH Paper 4: https://lnkd.in/dTGT8C-Y Paper 5: https://lnkd.in/dfswG9c4 These papers delve into advanced quantization methods that optimize model size and computational efficiency without compromising performance. They're a must-read for anyone interested in scaling AI models for deployment on resource-constrained devices. 2. Innovative 2-Bit Quantization Papers: Paper 1: https://lnkd.in/d_hg452Q Paper 2: https://lnkd.in/d992ihhJ Paper 3: https://lnkd.in/dtTUg9bW 2-bit quantization is a hot topic in the AI community, promising significant reductions in model size and inference latency while maintaining competitive accuracy. These papers introduce novel techniques to tackle the challenges of low-bit quantization, opening new avenues for efficient AI deployment. Excited to learn more? Stay tuned for my upcoming Medium post, where I'll provide an in-depth overview of all these quantization methods, breaking down the key concepts and implications for real-world applications. Don't miss out on the opportunity to stay at the forefront of AI optimization strategies! 🔍💡 Keep innovating, Arvind
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
arxiv.org
To view or add a comment, sign in
-
One of the best thing about this curriculum is that, we will be building a project and deploy it end to end.. Understanding a complete lifecycle is a real value add.
𝐊𝐚𝐫𝐩𝐚𝐭𝐡𝐲’𝐬 𝐋𝐋𝐌1-1𝐧 𝐂𝐨𝐮𝐫𝐬𝐞 𝐒𝐲𝐥𝐥𝐚𝐛𝐮𝐬 Karpathy recently announced LLM101n course. Here is the syllabus for this course - Chapter 01 Bigram Language Model (language modeling) - Chapter 02 Micrograd (machine learning, backpropagation) - Chapter 03 N-gram model (multi-layer perceptron, matmul, gelu) - Chapter 04 Attention (attention, softmax, positional encoder) - Chapter 05 Transformer (transformer, residual, layernorm, GPT-2) - Chapter 06 Tokenization (minBPE, byte pair encoding) - Chapter 07 Optimization (initialization, optimization, AdamW) - Chapter 08 Need for Speed I: Device (device, CPU, GPU, ...) - Chapter 09 Need for Speed II: Precision (mixed precision training, fp16, bf16, fp8, ...) - Chapter 10 Need for Speed III: Distributed (distributed optimization, DDP, ZeRO) - Chapter 11 Datasets (datasets, data loading, synthetic data generation) - Chapter 12 Inference I: kv-cache (kv-cache) - Chapter 13 Inference II: Quantization (quantization) - Chapter 14 Finetuning I: SFT (supervised finetuning SFT, PEFT, LoRA, chat) - Chapter 15 Finetuning II: RL (reinforcement learning, RLHF, PPO, DPO) - Chapter 16 Deployment (API, web app) - Chapter 17 Multimodal (VQVAE, diffusion transformer) LLM101n course details (in the comments) #onlinecourse #karpathy #llms #generativeai #nlproc #deeplearning #transformers #genai
To view or add a comment, sign in
-
𝐊𝐚𝐫𝐩𝐚𝐭𝐡𝐲’𝐬 𝐋𝐋𝐌1-1𝐧 𝐂𝐨𝐮𝐫𝐬𝐞 𝐒𝐲𝐥𝐥𝐚𝐛𝐮𝐬 Karpathy recently announced LLM101n course. Here is the syllabus for this course - Chapter 01 Bigram Language Model (language modeling) - Chapter 02 Micrograd (machine learning, backpropagation) - Chapter 03 N-gram model (multi-layer perceptron, matmul, gelu) - Chapter 04 Attention (attention, softmax, positional encoder) - Chapter 05 Transformer (transformer, residual, layernorm, GPT-2) - Chapter 06 Tokenization (minBPE, byte pair encoding) - Chapter 07 Optimization (initialization, optimization, AdamW) - Chapter 08 Need for Speed I: Device (device, CPU, GPU, ...) - Chapter 09 Need for Speed II: Precision (mixed precision training, fp16, bf16, fp8, ...) - Chapter 10 Need for Speed III: Distributed (distributed optimization, DDP, ZeRO) - Chapter 11 Datasets (datasets, data loading, synthetic data generation) - Chapter 12 Inference I: kv-cache (kv-cache) - Chapter 13 Inference II: Quantization (quantization) - Chapter 14 Finetuning I: SFT (supervised finetuning SFT, PEFT, LoRA, chat) - Chapter 15 Finetuning II: RL (reinforcement learning, RLHF, PPO, DPO) - Chapter 16 Deployment (API, web app) - Chapter 17 Multimodal (VQVAE, diffusion transformer) LLM101n course details (in the comments) #onlinecourse #karpathy #llms #generativeai #nlproc #deeplearning #transformers #genai
To view or add a comment, sign in
-
Exciting strides in AI efficiency! 🚀 The introduction of MInference demonstrates the potential of dynamic sparse attention to significantly reduce latency in long-context LLMs. At Boltzmann, we're inspired by these advancements and committed to integrating such innovations to make AI more accessible and practical for all. 🤝 ⚡ https://lnkd.in/egC2MD2y
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
hqjiang.com
To view or add a comment, sign in
-
🗣️ Reminder of Extended Deadline! You still have the opportunity to share with the CODAI community your research on AI compilers! CODAI Workshop co-hosted at HiPEAC 2025 🗓 January 20th 2025 📍 Barcelona, Spain ⏳ Abstract submission: November 1st, 2024 ⏳ Paper submission: November 8th, 2024 ⏳ Acceptance notification: December 7th, 2024 ⏳ Camera-ready deadline: January 4th, 2025 The CODAI workshop is a great opportunity to share and learn about the latest developments of compilers for AI/ML in the academia and industry. 🎯 The topics of interest include, but are not limited to: - Compilers for AI: partitioning, mapping, retargeting, intermediate representations, (domain-specific) languages for heterogeneous systems and architectures, µC, DSP, etc. - Compiler Optimization techniques focused on Enhancing Performance: e.g. exploiting memory hierarchy, offloading, asynchronous execution - Compiler Optimization techniques focused on Efficiency and Speed : such as compression, pruning, quantization techniques, hardware/compiler-aware neural architecture search, sparsity, auto-tuning, etc. - Performance estimation for compiler technology: virtual prototyping, surrogate modeling, profiling, etc. - Code-generation and hardware-backends for embedded AI accelerators - especially for RISC-V-based accelerator platforms and beyond-von-Neumann AI accelerators is appreciated. - Applications: Compiler case studies for processing of vision, sensor signal processing, novel brain-inspired algorithms for Edge AI and on-device learning and distributed multimodal edge systems. Looking forward for your submissions! https://lnkd.in/euEp-i8B #edgeai #embeddedai #embeddedsystems #ai #compiler #accelerators #neuralnetworks #machinelearning #artificialintelligence #workshop #conference #airesearch #tinyml
To view or add a comment, sign in
-
PS: The content is yet to come, however it is worthy to follow the repo as contents get updated. Happy learning. #Learning
𝐊𝐚𝐫𝐩𝐚𝐭𝐡𝐲’𝐬 𝐋𝐋𝐌1-1𝐧 𝐂𝐨𝐮𝐫𝐬𝐞 𝐒𝐲𝐥𝐥𝐚𝐛𝐮𝐬 Karpathy recently announced LLM101n course. Here is the syllabus for this course - Chapter 01 Bigram Language Model (language modeling) - Chapter 02 Micrograd (machine learning, backpropagation) - Chapter 03 N-gram model (multi-layer perceptron, matmul, gelu) - Chapter 04 Attention (attention, softmax, positional encoder) - Chapter 05 Transformer (transformer, residual, layernorm, GPT-2) - Chapter 06 Tokenization (minBPE, byte pair encoding) - Chapter 07 Optimization (initialization, optimization, AdamW) - Chapter 08 Need for Speed I: Device (device, CPU, GPU, ...) - Chapter 09 Need for Speed II: Precision (mixed precision training, fp16, bf16, fp8, ...) - Chapter 10 Need for Speed III: Distributed (distributed optimization, DDP, ZeRO) - Chapter 11 Datasets (datasets, data loading, synthetic data generation) - Chapter 12 Inference I: kv-cache (kv-cache) - Chapter 13 Inference II: Quantization (quantization) - Chapter 14 Finetuning I: SFT (supervised finetuning SFT, PEFT, LoRA, chat) - Chapter 15 Finetuning II: RL (reinforcement learning, RLHF, PPO, DPO) - Chapter 16 Deployment (API, web app) - Chapter 17 Multimodal (VQVAE, diffusion transformer) LLM101n course details (in the comments) #onlinecourse #karpathy #llms #generativeai #nlproc #deeplearning #transformers #genai
To view or add a comment, sign in