Markell Richards’ Post

View profile for Markell Richards, graphic

Software Architect at Groups360 | Veteran | MLOPS & Kubernetes | AWS Certified

Insightful, looking forward to building some neural networks next quarter and exploring Sebastian Raschka, PhD's new book "Build a Large Language Model From Scratch." I own a copy of each of Sebastian's books. I highly recommend him to anyone wanting to start their journey in the ML & AI space. #Transformers #AI #LLMs #DeepLearning

View profile for Sebastian Raschka, PhD, graphic
Sebastian Raschka, PhD Sebastian Raschka, PhD is an Influencer

Machine learning and AI researcher • author of the "Build a Large Language Model From Scratch" book (amzn.to/4fqvn0D) • research engineer at Lightning AI • ex-statistics professor at University of Wisconsin-Madison

"What Matters In Transformers?" is an interesting paper (https://lnkd.in/g_Zqwf9M) that finds you can actually remove half of the attention layers in LLMs like Llama without noticeably reducing modeling performance. The concept is relatively simple. The authors delete attention layers, MLP layers, or entire transformer blocks: - Removing entire transformer blocks leads to significant performance degradation. - Removing MLP layers results in significant performance degradation. - Removing attention layers causes almost no performance degradation! In Llama 2 70B, even if half of the attention layers are deleted (which results in a 48% speed-up), there's only a 2.4% decrease in the model benchmarks. The author also recently added Llama 3 results to the paper, which are similar. The attention layers were not removed randomly but based on a cosine-based similarity score: If the input and output are very similar, the layer is redundant and can be removed. This is a super intriguing result and could potentially be combined with various model compression techniques (like pruning and quantization) for compounding effects. Furthermore, the layers are removed in a one-shot fashion (versus iterative fashion), and no (re)training is required after the removal. However, retraining the model after the removal could potentially even recover some of the lost performance. Overall, a very simple but very interesting study. It appears there might be lots of computational redundancy in larger architectures. One big caveat of this study, though, is that the focus is mostly on academic benchmarks (HellaSwag, MMLU, etc.). It's unclear how well the models perform on benchmarks measuring conversational performance.

  • No alternative text description for this image
Sebastian Raschka, PhD

Machine learning and AI researcher • author of the "Build a Large Language Model From Scratch" book (amzn.to/4fqvn0D) • research engineer at Lightning AI • ex-statistics professor at University of Wisconsin-Madison

1mo

Markell Richards Thanks for getting a copy! Happy reading and coding!

Philipp Nell

Solution Architect for Data and AI, Technology Scout - Views are my own

1mo

It's an excellent book, enjoy!

Ekta Bhojwani

Computer Vision and ML Engineer @ IntelliSee | Masters | Deep Learning| MLOps | Tensorrt | Higher Performance ML Benchmarking

1mo

Gladly second that!. Best thing happened to me knowing about Sebastian Rasckha @. and his contributions in the ML and AI space.

See more comments

To view or add a comment, sign in

Explore topics