One weird trick for reducing LLM parameter size.

One weird trick for reducing LLM parameter size.

"Distilling Step-by-step!" is an excellent paper from a few weeks back. Here's a breakdown of what we learned from this research:


We've known for a while that we can migrate tasks from larger to smaller models using knowledge distillation techniques. The general idea is to use a large foundational model (GPT4, Palm2 Unicorn, LLaMa 70b) to create high-quality training pairs for a domain-specific task. An example might be to train a smaller model for Q/A reasoning tasks in a vertical like law, medicine, or ever narrow domains. The prior art is to generate the training pairs and use that alone.

No alt text provided for this image


Not only can foundational models generate correct answers, but with the right prompts, they can also generate "rationale traces." This is like asking a test taker to show their work when taking an exam. As prompt engineers have discovered, asking a foundational model to explain their reasoning often leads to better (more accurate, truthful, and explainable) generations.


The one weird trick is to use those same show your work outputs as part of the training for the student model. Thus the distillation training pairs become [input + rationale, output].

No alt text provided for this image


Not only does this improve the task accuracy of the student model, it also reduces the training data needed!

No alt text provided for this image


"Today's LLMs are capable of explaining their predictions by generating high-quality reasoning steps (Wei et al., 2022; Kojima et al., 2022). These reasoning steps are used to augment input prompts to LLMs, improving their few-shot or zeroshot performance".
...
we leverage generated rationales as informative supervision to train smaller taskspecific models, i.e. models that can be deployed without incurring large computation or memory costs."

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics