Théo Boyer’s Post

View profile for Théo Boyer, graphic

Deep learning engineer

𝗟𝗹𝗮𝗺𝗮 𝟯 𝘂𝗻𝘃𝗲𝗶𝗹 𝗮 𝗳𝘂𝗻𝗱𝗮𝗺𝗲𝗻𝘁𝗮𝗹 𝗰𝗼𝗺𝗽𝘂𝘁𝗶𝗻𝗴 𝘁𝗿𝗮𝗱𝗲𝗼𝗳𝗳 𝘁𝗵𝗮𝘁 𝗳𝗲𝘄 𝗮𝗿𝗲 𝘀𝗲𝗲𝗶𝗻𝗴⬇️ 🦙Llama 3 just came out and to the surprise of many people, the L3-8B version can match the L2-70B one on several benchmarks. Several people also argued that the chinchilla paper (https://lnkd.in/dXaRqyea) was wrong because the llama 3 scaling is very far from chinchilla-optimal. 🤔 In my opinion, this is due to a misunderstanding of what the chinchilla paper is about and more generally of the effects of scale in deep learning. The bitter lesson (https://lnkd.in/d3sVYYgp) has become common knowledge in the field thanks to the GPT family that popularized it. We hear everywhere that "bigger models are better": Yes it's true, but in what sense? Many people have a mental model along the lines of "bigger models are more capable" meaning they would be able to achieve levels of performance that smaller models can't. ❌ This is not true with LLMs and is mainly because we train only once on tokens (most of them at least) and therefore overfitting is not possible. The 𝘳𝘦𝘢𝘭 advantage of bigger models over smaller ones is that they converge *faster*. In other words, you can reach a given level of NLL (modeling performance) with fewer training iterations. ❔ Then a natural question arises: Should I train a smaller model with more or a bigger model with fewer iterations? That's where the chinchilla paper brings a response. They say that the question they are answering is: "Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens?" (implying "to achieve the best possible NLL"). 🧐 Coming back to Llama 3, it's obvious that the Llama team is aware of this work and there is a reason why they didn't follow this recipe. My guess is that they are taking into account the inference cost in the balance (which is completely out of the scope of the chinchilla paper). When one "overtrains" an 8B model, one ends up with a worse "NLL to training compute" ratio, but also with a better "NLL to inference compute" ratio. ✅ Now if we think about it from a business perspective, it's a very powerful tradeoff because you can choose to buy performance either at a fixed cost (training) or at a variable one (inference). Hope this post helped you! #Llama3 #Meta #LLM #AI

  • No alternative text description for this image

Je ne mesure pas entièrement toutes les subtilités techniques entre avoir un grand modèle ou un petit mais de manière général dans l'ère des IA, on ne devrait même pas se poser cette question qui sous entend une opposition ou une compétition entre les deux types de modèles. On devrait avoir compris que tous les types de modèles sont importants et surtout complémentaires. Notre choix du meilleur dépendra de chaque situation, chaque projet et de chaque besoin. L'avenir est à la diversité et la multiplicité d'IA. Chercher le meilleur modèle pour toutes les situations est une logique d'un autre temps je trouve! lol Et un modèle qui propose un juste milieu a vraiment toute sa place aussi, Méta étend la diversité et les applications possibles en choisissant cette voie. Ils ont compris certains besoins et y répondent parfaitement. C'est justement cette logique de répondre à des besoins d'applications qui devraient guider les ingénieurs.

To view or add a comment, sign in

Explore topics