One weird trick for reducing LLM parameter size.

Sandeep Ghael

Director of Machine Learning - LLMs and Generative AI @ Spotify

Published Jul 26, 2023

"Distilling Step-by-step!" is an excellent paper from a few weeks back. Here's a breakdown of what we learned from this research:

We've known for a while that we can migrate tasks from larger to smaller models using knowledge distillation techniques. The general idea is to use a large foundational model (GPT4, Palm2 Unicorn, LLaMa 70b) to create high-quality training pairs for a domain-specific task. An example might be to train a smaller model for Q/A reasoning tasks in a vertical like law, medicine, or ever narrow domains. The prior art is to generate the training pairs and use that alone.

Not only can foundational models generate correct answers, but with the right prompts, they can also generate "rationale traces." This is like asking a test taker to show their work when taking an exam. As prompt engineers have discovered, asking a foundational model to explain their reasoning often leads to better (more accurate, truthful, and explainable) generations.

The one weird trick is to use those same show your work outputs as part of the training for the student model. Thus the distillation training pairs become [input + rationale, output].

Not only does this improve the task accuracy of the student model, it also reduces the training data needed!

"Today's LLMs are capable of explaining their predictions by generating high-quality reasoning steps (Wei et al., 2022; Kojima et al., 2022). These reasoning steps are used to augment input prompts to LLMs, improving their few-shot or zeroshot performance".

...

we leverage generated rationales as informative supervision to train smaller taskspecific models, i.e. models that can be deployed without incurring large computation or memory costs."

One weird trick for reducing LLM parameter size.

Sandeep Ghael

Director of Machine Learning - LLMs and Generative AI @ Spotify

More articles by this author

Insights from the community

Others also viewed

How to Improve Your LLM : Combine Evaluations with Analytics

System theories

Fuzzy Logic: A Powerful Petrophysical Tool

Carmakers Python - a matter of Outrage and Trust

“Process Simplification”

Changing Relationship Between Input and Output: Sigmoidal Thinking

The Feynman-Kac Formula for Multi-Dimensional Diffusion Process

What is a Correct Process?

C++ Core Guidelines: Rules to Statements and Arithmetic

How many as$#oles should you have on your team, by way of normal distribution

Explore topics

Decentralize all the things. Even ML.

Feb 2, 2021