Matt Kappes’ Post

Theoretical Biologist 〡 Scientific Advisor

1mo Edited

The Era of 1-bit LLMs “Every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption.

2402.17764

arxiv.org

To view or add a comment, sign in

More Relevant Posts

Hubert Mank

Purpose-driven security and privacy
1mo
Report this post
"1-bit LLM variant in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption." Whitepaper: https://lnkd.in/dTAaqvjG Github: https://lnkd.in/dKFNV-DX Twitter: https://lnkd.in/djMjGYYF

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

arxiv.org
Like Comment
To view or add a comment, sign in
Elizabeth Kwan

Software Development Engineer II @ Amazon AGI
9mo
Report this post
The era of 1-bit LLMs. Key takeaways: - Every single parameter (or weight) of the LLM is ternary {-1, 0, 1} - Matches the full-precision Transformer LLM with the same model size - More cost-effective in terms of latency, memory, throughput, and energy consumption #artificialintelligence #largelanguagemodel

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

arxiv.org
Like Comment
To view or add a comment, sign in
Jordi Carrera Ventura

Machine Learning, NLP, and Structured Knowledge
8mo
Report this post
Extreme quantization in Q2 2024: BitNet b1.58 matches LLaMA's accuracy using 3x less memory and running 2.5x faster. Researchers converted every Float16 into a ternary value (-1, 0, or 1) requiring far less memory, while keeping the number of total model parameters constant

2402.17764.pdf

arxiv.org
Like Comment
To view or add a comment, sign in
Ryan Wheeler

Vice President, Machine Learning at Toyota Connected North America
9mo
Report this post
Neat-o https://lnkd.in/g8PMDjwR It shows that BitNet b1.58 starts to match full precision LLaMA LLM at 3B model size in terms of perplexity, while being 2.71 times faster and using 3.55 times less GPU memory. In particular, BitNet b1.58 with a 3.9B model size is 2.4 times faster, consumes 3.32 times less memory, but performs significantly better than LLaMA LLM 3B.

2402.17764.pdf

arxiv.org

1 Comment
Like Comment
To view or add a comment, sign in
Jeevan Singh Bhoot

Machine Learning Engineer | MEng Information Engineering @ Cambridge
3mo Edited
Report this post
Very excited to share some of the research I've been doing on optimizing LLM inference using Block Floating Point 16 (BFP16), an effective 9-bit format and a viable alternative to FP8. Activation quantization is normally quite difficult, but we've shown that BFP16 is effective for quantizing both weights and activations without sacrificing model quality, providing a significant inference speed-up. Check out the blog post for more details! #llm #llama3 #quantization

Myrtle.ai

848 followers
3mo Edited

Can we reduce the cost and latency of an LLM without impacting quality? This study into BFP16 quantization for both weights and activations draws some very interesting conclusions: https://lnkd.in/dQwcPi_9 #machinelearning #llama3

Optimizing Llama3: Leveraging Blockfloat16 for Weights and Activations

https://myrtle.ai

4 Comments
Like Comment
To view or add a comment, sign in
Liz Corrigan

Delivering low latency AI for high performance systems
3mo
Report this post
A really nice result for LLama3 quantization using Block Float formats. Substantial improvements in compute performance with a simple numerical conversion.

Myrtle.ai

848 followers
3mo Edited

Can we reduce the cost and latency of an LLM without impacting quality? This study into BFP16 quantization for both weights and activations draws some very interesting conclusions: https://lnkd.in/dQwcPi_9 #machinelearning #llama3

Optimizing Llama3: Leveraging Blockfloat16 for Weights and Activations

https://myrtle.ai
Like Comment
To view or add a comment, sign in
António Félix

Direção Compliance- Crédito Agrícola
7mo
Report this post
LLM in a flash... Our method involves constructing an inference cost model that takes into account the characteristics of flash memory, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks

‘LLM in a Flash: Efficient Large Language Model Inference With Limited Memory’ (PDF)

daringfireball.net
Like Comment
To view or add a comment, sign in
Myrtle.ai

848 followers
3mo Edited
Report this post
Can we reduce the cost and latency of an LLM without impacting quality? This study into BFP16 quantization for both weights and activations draws some very interesting conclusions: https://lnkd.in/dQwcPi_9 #machinelearning #llama3

Optimizing Llama3: Leveraging Blockfloat16 for Weights and Activations

https://myrtle.ai
Like Comment
To view or add a comment, sign in
vikram sahu

cse |Java | python | OS| full stack | panda|node js |
3w
Report this post
I HAVE completed algorithm like searching (binary,linear, interpolation) sorting traversal (DFS,BFS, INORDER PRE-ORDER, POST ORDER)
Like Comment
To view or add a comment, sign in
Alex Smirnoff

Application Engineering & Machine Learning
6mo
Report this post
Good presentation on transformers encoders vs decoders https://lnkd.in/eXf7q6s4

Stanford CS25: V4 I Hyung Won Chung of OpenAI

https://www.youtube.com/
Like Comment
To view or add a comment, sign in

214 followers

2 Posts

View Profile Follow

Matt Kappes’ Post

More Relevant Posts

Stanford CS25: V4 I Hyung Won Chung of OpenAI

https://www.youtube.com/

Explore topics