Curtis Burkhalter, Ph.D.’s Post

AI Product @ HP

2mo Edited

Super interesting study about removing attention layers in LLMs with little performance degradation. I think as models become more and more commodified, it becomes less about having the biggest model and more about computational efficiency. #LLM #GenAI

Sebastian Raschka, PhD

Machine learning and AI researcher • author of the "Build a Large Language Model From Scratch" book (amzn.to/4fqvn0D) • research engineer at Lightning AI • ex-statistics professor at University of Wisconsin-Madison

2mo Edited

"What Matters In Transformers?" is an interesting paper (https://lnkd.in/g_Zqwf9M) that finds you can actually remove half of the attention layers in LLMs like Llama without noticeably reducing modeling performance. The concept is relatively simple. The authors delete attention layers, MLP layers, or entire transformer blocks: - Removing entire transformer blocks leads to significant performance degradation. - Removing MLP layers results in significant performance degradation. - Removing attention layers causes almost no performance degradation! In Llama 2 70B, even if half of the attention layers are deleted (which results in a 48% speed-up), there's only a 2.4% decrease in the model benchmarks. The author also recently added Llama 3 results to the paper, which are similar. The attention layers were not removed randomly but based on a cosine-based similarity score: If the input and output are very similar, the layer is redundant and can be removed. This is a super intriguing result and could potentially be combined with various model compression techniques (like pruning and quantization) for compounding effects. Furthermore, the layers are removed in a one-shot fashion (versus iterative fashion), and no (re)training is required after the removal. However, retraining the model after the removal could potentially even recover some of the lost performance. Overall, a very simple but very interesting study. It appears there might be lots of computational redundancy in larger architectures. One big caveat of this study, though, is that the focus is mostly on academic benchmarks (HellaSwag, MMLU, etc.). It's unclear how well the models perform on benchmarks measuring conversational performance.

To view or add a comment, sign in

More Relevant Posts

Markell Richards

Software Architect at Groups360 | Veteran | MLOPS & Kubernetes | AWS Certified
1mo
Report this post
Insightful, looking forward to building some neural networks next quarter and exploring Sebastian Raschka, PhD's new book "Build a Large Language Model From Scratch." I own a copy of each of Sebastian's books. I highly recommend him to anyone wanting to start their journey in the ML & AI space. #Transformers #AI #LLMs #DeepLearning
Sebastian Raschka, PhD Sebastian Raschka, PhD is an Influencer

Machine learning and AI researcher • author of the "Build a Large Language Model From Scratch" book (amzn.to/4fqvn0D) • research engineer at Lightning AI • ex-statistics professor at University of Wisconsin-Madison
2mo Edited

"What Matters In Transformers?" is an interesting paper (https://lnkd.in/g_Zqwf9M) that finds you can actually remove half of the attention layers in LLMs like Llama without noticeably reducing modeling performance. The concept is relatively simple. The authors delete attention layers, MLP layers, or entire transformer blocks: - Removing entire transformer blocks leads to significant performance degradation. - Removing MLP layers results in significant performance degradation. - Removing attention layers causes almost no performance degradation! In Llama 2 70B, even if half of the attention layers are deleted (which results in a 48% speed-up), there's only a 2.4% decrease in the model benchmarks. The author also recently added Llama 3 results to the paper, which are similar. The attention layers were not removed randomly but based on a cosine-based similarity score: If the input and output are very similar, the layer is redundant and can be removed. This is a super intriguing result and could potentially be combined with various model compression techniques (like pruning and quantization) for compounding effects. Furthermore, the layers are removed in a one-shot fashion (versus iterative fashion), and no (re)training is required after the removal. However, retraining the model after the removal could potentially even recover some of the lost performance. Overall, a very simple but very interesting study. It appears there might be lots of computational redundancy in larger architectures. One big caveat of this study, though, is that the focus is mostly on academic benchmarks (HellaSwag, MMLU, etc.). It's unclear how well the models perform on benchmarks measuring conversational performance.
4 Comments
Like Comment
To view or add a comment, sign in
Daniel Boza

AWS Architect - AI Engineer
2mo
Report this post
Interesting approach for all of us that needs smaller models with good performance.
Sebastian Raschka, PhD Sebastian Raschka, PhD is an Influencer

Machine learning and AI researcher • author of the "Build a Large Language Model From Scratch" book (amzn.to/4fqvn0D) • research engineer at Lightning AI • ex-statistics professor at University of Wisconsin-Madison
2mo Edited

"What Matters In Transformers?" is an interesting paper (https://lnkd.in/g_Zqwf9M) that finds you can actually remove half of the attention layers in LLMs like Llama without noticeably reducing modeling performance. The concept is relatively simple. The authors delete attention layers, MLP layers, or entire transformer blocks: - Removing entire transformer blocks leads to significant performance degradation. - Removing MLP layers results in significant performance degradation. - Removing attention layers causes almost no performance degradation! In Llama 2 70B, even if half of the attention layers are deleted (which results in a 48% speed-up), there's only a 2.4% decrease in the model benchmarks. The author also recently added Llama 3 results to the paper, which are similar. The attention layers were not removed randomly but based on a cosine-based similarity score: If the input and output are very similar, the layer is redundant and can be removed. This is a super intriguing result and could potentially be combined with various model compression techniques (like pruning and quantization) for compounding effects. Furthermore, the layers are removed in a one-shot fashion (versus iterative fashion), and no (re)training is required after the removal. However, retraining the model after the removal could potentially even recover some of the lost performance. Overall, a very simple but very interesting study. It appears there might be lots of computational redundancy in larger architectures. One big caveat of this study, though, is that the focus is mostly on academic benchmarks (HellaSwag, MMLU, etc.). It's unclear how well the models perform on benchmarks measuring conversational performance.
Like Comment
To view or add a comment, sign in
Craig Pfeifer

Staff ML Engineer @ Lightning.ai, reducing the plumbing of deep learning
2mo
Report this post
This is my favorite kind of work: making things simpler but getting the same result. Most papers I've read are along the lines of "SOTA is X, we added more {data, layers, training iterations...} and the score went up y%" and that is a valid point solution. However, to understand the larger decision space we have to ask what is the actually happening, and how else can we achieve these results?
Sebastian Raschka, PhD Sebastian Raschka, PhD is an Influencer

Machine learning and AI researcher • author of the "Build a Large Language Model From Scratch" book (amzn.to/4fqvn0D) • research engineer at Lightning AI • ex-statistics professor at University of Wisconsin-Madison
2mo Edited

"What Matters In Transformers?" is an interesting paper (https://lnkd.in/g_Zqwf9M) that finds you can actually remove half of the attention layers in LLMs like Llama without noticeably reducing modeling performance. The concept is relatively simple. The authors delete attention layers, MLP layers, or entire transformer blocks: - Removing entire transformer blocks leads to significant performance degradation. - Removing MLP layers results in significant performance degradation. - Removing attention layers causes almost no performance degradation! In Llama 2 70B, even if half of the attention layers are deleted (which results in a 48% speed-up), there's only a 2.4% decrease in the model benchmarks. The author also recently added Llama 3 results to the paper, which are similar. The attention layers were not removed randomly but based on a cosine-based similarity score: If the input and output are very similar, the layer is redundant and can be removed. This is a super intriguing result and could potentially be combined with various model compression techniques (like pruning and quantization) for compounding effects. Furthermore, the layers are removed in a one-shot fashion (versus iterative fashion), and no (re)training is required after the removal. However, retraining the model after the removal could potentially even recover some of the lost performance. Overall, a very simple but very interesting study. It appears there might be lots of computational redundancy in larger architectures. One big caveat of this study, though, is that the focus is mostly on academic benchmarks (HellaSwag, MMLU, etc.). It's unclear how well the models perform on benchmarks measuring conversational performance.
1 Comment
Like Comment
To view or add a comment, sign in
Sebastian Raschka, PhD Sebastian Raschka, PhD is an Influencer

Machine learning and AI researcher • author of the "Build a Large Language Model From Scratch" book (amzn.to/4fqvn0D) • research engineer at Lightning AI • ex-statistics professor at University of Wisconsin-Madison
2mo Edited
Report this post
"What Matters In Transformers?" is an interesting paper (https://lnkd.in/g_Zqwf9M) that finds you can actually remove half of the attention layers in LLMs like Llama without noticeably reducing modeling performance. The concept is relatively simple. The authors delete attention layers, MLP layers, or entire transformer blocks: - Removing entire transformer blocks leads to significant performance degradation. - Removing MLP layers results in significant performance degradation. - Removing attention layers causes almost no performance degradation! In Llama 2 70B, even if half of the attention layers are deleted (which results in a 48% speed-up), there's only a 2.4% decrease in the model benchmarks. The author also recently added Llama 3 results to the paper, which are similar. The attention layers were not removed randomly but based on a cosine-based similarity score: If the input and output are very similar, the layer is redundant and can be removed. This is a super intriguing result and could potentially be combined with various model compression techniques (like pruning and quantization) for compounding effects. Furthermore, the layers are removed in a one-shot fashion (versus iterative fashion), and no (re)training is required after the removal. However, retraining the model after the removal could potentially even recover some of the lost performance. Overall, a very simple but very interesting study. It appears there might be lots of computational redundancy in larger architectures. One big caveat of this study, though, is that the focus is mostly on academic benchmarks (HellaSwag, MMLU, etc.). It's unclear how well the models perform on benchmarks measuring conversational performance.
112 Comments
Like Comment
To view or add a comment, sign in
Azmyin Md. Kamal

Graduate Research and Teaching Assistant @ iCORE, MIE, LSU, U.S.A. Roboticist specializing in collaborative VSLAM, 3D object detection and model predictive motion planning of autonomous agents.
2mo
Report this post
From what I had observed since getting into the field of robotics and computer vision, when neural networks gained popularity in Computer Vision tasks in early 2013-2014s, there was a trend of using bigger and bigger networks for performance gains. Then came the years where models were optimized (and still are) in terms size, data type, conpression and deployment in specialized hardware for very impressive real time performances. I think Transformers are probably heading towards the same direction where in the next few years we might see LLMs and its derived variants would start to decrease in size, utilize multi modal data more efficiently and generalize into multiple domains all the while increasing in throughput by leveraging modern GPGPU and other acceleration hardware.
Sebastian Raschka, PhD Sebastian Raschka, PhD is an Influencer

Machine learning and AI researcher • author of the "Build a Large Language Model From Scratch" book (amzn.to/4fqvn0D) • research engineer at Lightning AI • ex-statistics professor at University of Wisconsin-Madison
2mo Edited

"What Matters In Transformers?" is an interesting paper (https://lnkd.in/g_Zqwf9M) that finds you can actually remove half of the attention layers in LLMs like Llama without noticeably reducing modeling performance. The concept is relatively simple. The authors delete attention layers, MLP layers, or entire transformer blocks: - Removing entire transformer blocks leads to significant performance degradation. - Removing MLP layers results in significant performance degradation. - Removing attention layers causes almost no performance degradation! In Llama 2 70B, even if half of the attention layers are deleted (which results in a 48% speed-up), there's only a 2.4% decrease in the model benchmarks. The author also recently added Llama 3 results to the paper, which are similar. The attention layers were not removed randomly but based on a cosine-based similarity score: If the input and output are very similar, the layer is redundant and can be removed. This is a super intriguing result and could potentially be combined with various model compression techniques (like pruning and quantization) for compounding effects. Furthermore, the layers are removed in a one-shot fashion (versus iterative fashion), and no (re)training is required after the removal. However, retraining the model after the removal could potentially even recover some of the lost performance. Overall, a very simple but very interesting study. It appears there might be lots of computational redundancy in larger architectures. One big caveat of this study, though, is that the focus is mostly on academic benchmarks (HellaSwag, MMLU, etc.). It's unclear how well the models perform on benchmarks measuring conversational performance.
Like Comment
To view or add a comment, sign in
Mohammad Samavat, PhD

Sejnowski Lab, Post Doctoral Fellow, Salk Institute for Biological Studies & Institute for Neural Computation at UC San Diego, PhD in ECE from University of California, San Diego,
2mo
Report this post
Let’s dive into transformers!
Sebastian Raschka, PhD Sebastian Raschka, PhD is an Influencer

Machine learning and AI researcher • author of the "Build a Large Language Model From Scratch" book (amzn.to/4fqvn0D) • research engineer at Lightning AI • ex-statistics professor at University of Wisconsin-Madison
2mo Edited

"What Matters In Transformers?" is an interesting paper (https://lnkd.in/g_Zqwf9M) that finds you can actually remove half of the attention layers in LLMs like Llama without noticeably reducing modeling performance. The concept is relatively simple. The authors delete attention layers, MLP layers, or entire transformer blocks: - Removing entire transformer blocks leads to significant performance degradation. - Removing MLP layers results in significant performance degradation. - Removing attention layers causes almost no performance degradation! In Llama 2 70B, even if half of the attention layers are deleted (which results in a 48% speed-up), there's only a 2.4% decrease in the model benchmarks. The author also recently added Llama 3 results to the paper, which are similar. The attention layers were not removed randomly but based on a cosine-based similarity score: If the input and output are very similar, the layer is redundant and can be removed. This is a super intriguing result and could potentially be combined with various model compression techniques (like pruning and quantization) for compounding effects. Furthermore, the layers are removed in a one-shot fashion (versus iterative fashion), and no (re)training is required after the removal. However, retraining the model after the removal could potentially even recover some of the lost performance. Overall, a very simple but very interesting study. It appears there might be lots of computational redundancy in larger architectures. One big caveat of this study, though, is that the focus is mostly on academic benchmarks (HellaSwag, MMLU, etc.). It's unclear how well the models perform on benchmarks measuring conversational performance.
Like Comment
To view or add a comment, sign in
Piotr G.

Data Scientist | Python, R, SQL | econometrics & ML | exp. in Finance, Energy, Insurance and Industry
2mo
Report this post
MLP layer > attention layer
Sebastian Raschka, PhD Sebastian Raschka, PhD is an Influencer

Machine learning and AI researcher • author of the "Build a Large Language Model From Scratch" book (amzn.to/4fqvn0D) • research engineer at Lightning AI • ex-statistics professor at University of Wisconsin-Madison
2mo Edited

"What Matters In Transformers?" is an interesting paper (https://lnkd.in/g_Zqwf9M) that finds you can actually remove half of the attention layers in LLMs like Llama without noticeably reducing modeling performance. The concept is relatively simple. The authors delete attention layers, MLP layers, or entire transformer blocks: - Removing entire transformer blocks leads to significant performance degradation. - Removing MLP layers results in significant performance degradation. - Removing attention layers causes almost no performance degradation! In Llama 2 70B, even if half of the attention layers are deleted (which results in a 48% speed-up), there's only a 2.4% decrease in the model benchmarks. The author also recently added Llama 3 results to the paper, which are similar. The attention layers were not removed randomly but based on a cosine-based similarity score: If the input and output are very similar, the layer is redundant and can be removed. This is a super intriguing result and could potentially be combined with various model compression techniques (like pruning and quantization) for compounding effects. Furthermore, the layers are removed in a one-shot fashion (versus iterative fashion), and no (re)training is required after the removal. However, retraining the model after the removal could potentially even recover some of the lost performance. Overall, a very simple but very interesting study. It appears there might be lots of computational redundancy in larger architectures. One big caveat of this study, though, is that the focus is mostly on academic benchmarks (HellaSwag, MMLU, etc.). It's unclear how well the models perform on benchmarks measuring conversational performance.
Like Comment
To view or add a comment, sign in
Denis Parra Santander

Associate Professor at Pontificia Universidad Católica de Chile
2mo Edited
Report this post
This ArXiv pre-print posted in August 2024 (by Sun et al. from Sakana and Emergence AI) tells a very similar story "Transformers Layers as Painters" where they show strong redundancy of inner layers in the transformer, but input and output layers are extremely important https://lnkd.in/ePxK5grg They address these questions: 1. Do layers use the same representation space? (§3.1) 2. Are all the layers necessary? (§3.2) 3. Are middle layers all doing the same function? (§3.3) 4. Does the layer order matter? (§3.4) 5. Can we run the layers in parallel? (§3.5) 6. Does order matter for some tasks more than others? (§3.6) 7. Does looping help parallelized layers? (§3.7) 8. Which variants harm performance the least? (§3.8)
Sebastian Raschka, PhD Sebastian Raschka, PhD is an Influencer

Machine learning and AI researcher • author of the "Build a Large Language Model From Scratch" book (amzn.to/4fqvn0D) • research engineer at Lightning AI • ex-statistics professor at University of Wisconsin-Madison
2mo Edited

"What Matters In Transformers?" is an interesting paper (https://lnkd.in/g_Zqwf9M) that finds you can actually remove half of the attention layers in LLMs like Llama without noticeably reducing modeling performance. The concept is relatively simple. The authors delete attention layers, MLP layers, or entire transformer blocks: - Removing entire transformer blocks leads to significant performance degradation. - Removing MLP layers results in significant performance degradation. - Removing attention layers causes almost no performance degradation! In Llama 2 70B, even if half of the attention layers are deleted (which results in a 48% speed-up), there's only a 2.4% decrease in the model benchmarks. The author also recently added Llama 3 results to the paper, which are similar. The attention layers were not removed randomly but based on a cosine-based similarity score: If the input and output are very similar, the layer is redundant and can be removed. This is a super intriguing result and could potentially be combined with various model compression techniques (like pruning and quantization) for compounding effects. Furthermore, the layers are removed in a one-shot fashion (versus iterative fashion), and no (re)training is required after the removal. However, retraining the model after the removal could potentially even recover some of the lost performance. Overall, a very simple but very interesting study. It appears there might be lots of computational redundancy in larger architectures. One big caveat of this study, though, is that the focus is mostly on academic benchmarks (HellaSwag, MMLU, etc.). It's unclear how well the models perform on benchmarks measuring conversational performance.
Like Comment
To view or add a comment, sign in
Ra Kowalski

Senior Solutions Architect at Amazon Web Services (AWS)
1mo
Report this post
It’s an interesting finding that “removing [some] attention layers causes almost no performance degradation” in models like Llama, as the attention layers are at the core of the transformer architecture success: “Attention is all you need”, right? Anyway, adding the paper to the read list. #ai #ml #transformer
Sebastian Raschka, PhD Sebastian Raschka, PhD is an Influencer

Machine learning and AI researcher • author of the "Build a Large Language Model From Scratch" book (amzn.to/4fqvn0D) • research engineer at Lightning AI • ex-statistics professor at University of Wisconsin-Madison
2mo Edited

"What Matters In Transformers?" is an interesting paper (https://lnkd.in/g_Zqwf9M) that finds you can actually remove half of the attention layers in LLMs like Llama without noticeably reducing modeling performance. The concept is relatively simple. The authors delete attention layers, MLP layers, or entire transformer blocks: - Removing entire transformer blocks leads to significant performance degradation. - Removing MLP layers results in significant performance degradation. - Removing attention layers causes almost no performance degradation! In Llama 2 70B, even if half of the attention layers are deleted (which results in a 48% speed-up), there's only a 2.4% decrease in the model benchmarks. The author also recently added Llama 3 results to the paper, which are similar. The attention layers were not removed randomly but based on a cosine-based similarity score: If the input and output are very similar, the layer is redundant and can be removed. This is a super intriguing result and could potentially be combined with various model compression techniques (like pruning and quantization) for compounding effects. Furthermore, the layers are removed in a one-shot fashion (versus iterative fashion), and no (re)training is required after the removal. However, retraining the model after the removal could potentially even recover some of the lost performance. Overall, a very simple but very interesting study. It appears there might be lots of computational redundancy in larger architectures. One big caveat of this study, though, is that the focus is mostly on academic benchmarks (HellaSwag, MMLU, etc.). It's unclear how well the models perform on benchmarks measuring conversational performance.
Like Comment
To view or add a comment, sign in
Meriem Slimani

AI Engineer
1mo Edited
Report this post
A February 2024 paper titled "Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods" also demonstrates that not all attention heads are necessary. These findings hold great promise for enhancing the speed and performance of future models and may influence how they are designed moving forward. The near future might be about smaller language models ! ArXiv paper: https://lnkd.in/g-CiP_Pm
Sebastian Raschka, PhD Sebastian Raschka, PhD is an Influencer

Machine learning and AI researcher • author of the "Build a Large Language Model From Scratch" book (amzn.to/4fqvn0D) • research engineer at Lightning AI • ex-statistics professor at University of Wisconsin-Madison
2mo Edited

"What Matters In Transformers?" is an interesting paper (https://lnkd.in/g_Zqwf9M) that finds you can actually remove half of the attention layers in LLMs like Llama without noticeably reducing modeling performance. The concept is relatively simple. The authors delete attention layers, MLP layers, or entire transformer blocks: - Removing entire transformer blocks leads to significant performance degradation. - Removing MLP layers results in significant performance degradation. - Removing attention layers causes almost no performance degradation! In Llama 2 70B, even if half of the attention layers are deleted (which results in a 48% speed-up), there's only a 2.4% decrease in the model benchmarks. The author also recently added Llama 3 results to the paper, which are similar. The attention layers were not removed randomly but based on a cosine-based similarity score: If the input and output are very similar, the layer is redundant and can be removed. This is a super intriguing result and could potentially be combined with various model compression techniques (like pruning and quantization) for compounding effects. Furthermore, the layers are removed in a one-shot fashion (versus iterative fashion), and no (re)training is required after the removal. However, retraining the model after the removal could potentially even recover some of the lost performance. Overall, a very simple but very interesting study. It appears there might be lots of computational redundancy in larger architectures. One big caveat of this study, though, is that the focus is mostly on academic benchmarks (HellaSwag, MMLU, etc.). It's unclear how well the models perform on benchmarks measuring conversational performance.
Like Comment
To view or add a comment, sign in

1,740 followers

72 Posts

View Profile Follow

Curtis Burkhalter, Ph.D.’s Post

More Relevant Posts

Explore topics