𝗟𝗹𝗮𝗺𝗮 𝟯 𝘂𝗻𝘃𝗲𝗶𝗹 𝗮 𝗳𝘂𝗻𝗱𝗮𝗺𝗲𝗻𝘁𝗮𝗹 𝗰𝗼𝗺𝗽𝘂𝘁𝗶𝗻𝗴 𝘁𝗿𝗮𝗱𝗲𝗼𝗳𝗳 𝘁𝗵𝗮𝘁 𝗳𝗲𝘄 𝗮𝗿𝗲 𝘀𝗲𝗲𝗶𝗻𝗴⬇️ 🦙Llama 3 just came out and to the surprise of many people, the L3-8B version can match the L2-70B one on several benchmarks. Several people also argued that the chinchilla paper (https://lnkd.in/dXaRqyea) was wrong because the llama 3 scaling is very far from chinchilla-optimal. 🤔 In my opinion, this is due to a misunderstanding of what the chinchilla paper is about and more generally of the effects of scale in deep learning. The bitter lesson (https://lnkd.in/d3sVYYgp) has become common knowledge in the field thanks to the GPT family that popularized it. We hear everywhere that "bigger models are better": Yes it's true, but in what sense? Many people have a mental model along the lines of "bigger models are more capable" meaning they would be able to achieve levels of performance that smaller models can't. ❌ This is not true with LLMs and is mainly because we train only once on tokens (most of them at least) and therefore overfitting is not possible. The 𝘳𝘦𝘢𝘭 advantage of bigger models over smaller ones is that they converge *faster*. In other words, you can reach a given level of NLL (modeling performance) with fewer training iterations. ❔ Then a natural question arises: Should I train a smaller model with more or a bigger model with fewer iterations? That's where the chinchilla paper brings a response. They say that the question they are answering is: "Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens?" (implying "to achieve the best possible NLL"). 🧐 Coming back to Llama 3, it's obvious that the Llama team is aware of this work and there is a reason why they didn't follow this recipe. My guess is that they are taking into account the inference cost in the balance (which is completely out of the scope of the chinchilla paper). When one "overtrains" an 8B model, one ends up with a worse "NLL to training compute" ratio, but also with a better "NLL to inference compute" ratio. ✅ Now if we think about it from a business perspective, it's a very powerful tradeoff because you can choose to buy performance either at a fixed cost (training) or at a variable one (inference). Hope this post helped you! #Llama3 #Meta #LLM #AI
Théo Boyer’s Post
More Relevant Posts
-
Building LLMs from scratch involves the following steps: First we can say the choice of LLM pre-trained model, fine-tuning approach, and evaluation metrics depends on the specific requirements of the application. To start from the scratch: 1. Collecting a large, high-quality dataset for training the model. This typically involves scraping text data from the internet, including websites, social media, academic papers, etc. to create a diverse corpus. 2. Using machine learning frameworks like TensorFlow or PyTorch to create the LLM architecture. These frameworks provide pre-built tools and libraries for training large neural networks. 3. Training the LLM in a self-supervised manner to predict the next word in the text, a technique known as autoregressive training. 4. Evaluating the LLM's performance using extrinsic methods like the Language Model Evaluation Harness, which tests the model on tasks like reasoning, problem-solving, and common sense inference. 5. Deploying the trained LLM as a web service using serverless technologies like AWS Lambda or containerization with Docker. 6. Large clusters of GPUs or TBUs for parallel processing. For domain-specific LLMs, the key techniques are: 1. Fine-tuning a foundational LLM like GPT or LLaMA on a curated, labeled dataset relevant to the target industry or use case. 2. Using supervised learning techniques to train the model on the domain-specific data, in contrast to the self-supervised pre-training of general LLMs.
To view or add a comment, sign in
-
OpenAI Strawberry (o1) is out! We are finally seeing the paradigm of inference-time scaling popularized and deployed in production. As Sutton said in the Bitter Lesson, there're only 2 techniques that scale indefinitely with compute: learning & search. It's time to shift focus to the latter. 1. You don't need a huge model to perform reasoning. Lots of parameters are dedicated to memorizing facts, in order to perform well in benchmarks like trivia QA. It is possible to factor out reasoning from knowledge, i.e. a small "reasoning core" that knows how to call tools like browser and code verifier. Pre-training compute may be decreased. 2. A huge amount of compute is shifted to serving inference instead of pre/post-training. LLMs are text-based simulators. By rolling out many possible strategies and scenarios in the simulator, the model will eventually converge to good solutions. The process is a well-studied problem like AlphaGo's monte carlo tree search (MCTS). 3. OpenAI must have figured out the inference scaling law a long time ago, which academia is just recently discovering. Two papers came out on Arxiv a week apart last month: - Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Brown et al. finds that DeepSeek-Coder increases from 15.9% with one sample to 56% with 250 samples on SWE-Bench, beating Sonnet-3.5. - Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Snell et al. finds that PaLM 2-S beats a 14x larger model on MATH with test-time search. 4. Productionizing o1 is much harder than nailing the academic benchmarks. For reasoning problems in the wild, how to decide when to stop searching? What's the reward function? Success criterion? When to call tools like code interpreter in the loop? How to factor in the compute cost of those CPU processes? Their research post didn't share much. 5. Strawberry easily becomes a data flywheel. If the answer is correct, the entire search trace becomes a mini dataset of training examples, which contain both positive and negative rewards. This in turn improves the reasoning core for future versions of GPT, similar to how AlphaGo’s value network — used to evaluate quality of each board position — improves as MCTS generates more and more refined training data. #openai #ai #artificialintelligence #nvidia #samaltman
To view or add a comment, sign in
-
-
Happy to announce lakeFS Mount is officially released! 🗻🎉 I really believe this is a game changer for AI and deep learning workloads. With lakeFS Mount, you can transparently mount a lakeFS reference as a local directory (yes, even at petabyte-scale), while avoiding the common pitfalls typically associated with trying to access an object store as a filesystem: 1. Metadata operations (listing directories, getting file info) don't require a server roundtrip: we rely on lakeFS' metadata being immutable and pre-fetch the raw metadata files to complete these operations locally in microseconds 2. Content addressable caching & pre-fetching: as objects are immutable, we can cache them based on their lakeFS identity without having to worry about consistency and invalidation. This also allows pre-filling that cache based on custom rules to allow NVME access speed for objects likely to be accessed This allows best-in-class I/O performance, while simultaneously making it extremely easy to use - Just read from a local directory. No integration or custom SDKs required. All this while ensuring full reproducibility and enjoying the benefits of version control such as branching, rolling back, tagging and more. Read more about the optimizations done to achieve this on the lakeFS blog: https://lnkd.in/dsmBq7hq
To view or add a comment, sign in
-
🎉 Qiskit Machine Learning 0.8.0 is out! Upgrade your environments now. https://lnkd.in/e562xSRS We're announcing a major release, packed with new features for quantum machine learning workflows on simulators and IBM Quantum hardware. Here's a rundown of what's new in this release. 🚀 Features and improvements 1️⃣ Qiskit Machine Learning now supports version-2 (V2) Qiskit primitives! Make full use of the latest features in the Qiskit ecosystem, and most importantly, run ISA-compatible jobs on quantum hardware. 2️⃣ The new Quantum Bayesian Inference module can sample (or query) a Bayesian network with causal links using quantum circuit representation. This method is based on work by Hao Low, Yoder, and Chuang (https://lnkd.in/eP5DeDRC). Thanks to Peter Roeseler for the implementation and the contribution. 3️⃣ Key components of Qiskit Algorithms are now part of Qiskit Machine Learning to consolidate a back-to-back infrastructure with gradients, optimisers and state fidelities. 🪲 Bug fixes and other changes ➡️ Introduced Python 3.12 support, and removed Python 3.8 support because of end-of-life. ➡️ Qiskit 1.0 or higher is now a hard requirement. To upgrade, refer to the Qiskit migration guide and access the latest features, faster circuit synthesis (via Rust) and lower memory footprint. More info at https://lnkd.in/ed3iexxS ➡️ Primitives V1 are now deprecated and will be supported until version 0.9, after which they will be removed from Qiskit Machine Learning. Primitives V2 are progressively becoming the default in 0.8.x releases. To upgrade from V1 to V2, find more info at https://lnkd.in/eNF4TPDy 💡 New to Quantum Machine Learning? Check out the tutorials (https://lnkd.in/eExuDgXX) and documentation (https://lnkd.in/eYs6_v73) to see if this library suits your task. Join the Qiskit Community on Slack (https://qisk.it/join-slack, find the #qiskit-machine-learning channel) and share your feedback! 🙌 Thanks to Oscar Wallis and M. Emre Şahin for the hard work on the (infamous) https://lnkd.in/edgXCk55 pull request. Finally, thanks to the new co-authors for the v0.8.0 release: @MoritzWillmann, @espREssOOOHHH, @OkuyanBoga, @oscar-wallis, @FrancescaSchiav, @edoaltamura, and @proeseler; and Anton Dekusar, Steve Wood, and Declan Millar for the advice. STFC Hartree Centre | Stefano Mensa, PhD MBCS | IBM | IBM Research #SoftwareUpdate #MachineLearning #QuantumComputing #machinelearning
GitHub - qiskit-community/qiskit-machine-learning: Quantum Machine Learning
github.com
To view or add a comment, sign in
-
A little bit technical in this post. To learn or not to learn (machine learning). Machine Learning is a lazy approach to solve a problem. Because: - if you can compute the problem, compute - if you can't compute, you learn from data and take a guess at the unlearned situations So machine learning is simply a bunch of guessing methods to apply on known and solved problems set. Nothing magical. That laziness comes with a cost: lots of data, computational power and time, human resources. It seems like an easy way to solve any problem, yes it can work that way. Back to basics: if you can compute, compute, if you can't compute, you learn and mimick. How awful this perspective can describe AI and ML though it's true. Real computing techniques (and sure it's hard): differential equations. Everything can be described by an equation. How it changes, over time, or with respect to another or multiple variables, can be described with differentiation (derivative). 90% of programmers who couldn't solve a large-scale problem will try to use a tensor-like algorithm just because it's new and trending. Guess what? Math was trending in the past too, "#Most #Problems #Of #The #World #Have #Been #Solved", researchers right now are trying to solve it again using different ways, it is researching and experimenting. Research papers should be treated like an exercise, a reference; not goals. And certainly they are not usable for an actual practical implementation in a real tech product, especially with machine learning these days. Notes: - this post is a byproduct of my own research journey with a purpose of producing an applicable solution - the real solution we end up referring to, is first appeared in 2006 (application to the specific domain), and in the years of 18xx (math model) - everything else is simply a variation of the 2006 solution, they made dozens of it in 202x too, and also tomorrow will be. Noise ! Startups, be aware of the "research papers wind". Bottom line: - 10% of the programmers who couldn't solve choose to upgrade their skills, instead - Example: There are people who choose to use simple (y = ax + b) math to generate money (Warren Buffett), there are people who use advanced math (Jim Simons) -> Everything is solvable, methods don't matter. Who do you want to become? Just make the choice. See that I'm not judging, I'm illustrating. Cheers.
To view or add a comment, sign in
-
-
A New Strategy for Reducing Latency with Deep Learning in Fog Computing Environment The data generated by connected objects is becoming increasingly numerous and often cyclical. Fog computing (FC) has emerged as an attractive solution to bring data closer to the edge, meet requirements, and manage the growing demand for data. However, network congestion produced by connected devices increases latency and energy consumption. In addition, managing similar processes in fog nodes is difficult. Some processes evolve rapidly into complicated, heterogeneous , and dynamic structures. A reduction of latency, bandwidth, and energy consumption represents issues that can be addressed by neural networks. Indeed, Deep learning can offer fast, reliable processing times on huge quantities of data. Therefore, integrating deep learning in a fog environment would be interesting. Therefore, we proposed a new strategy that enables the selection of the best fog node within a given zone by leveraging a deep learning-based LSTM model (BRFC-LSTM) and metrics such as data size, bandwidth, and the number of layers in the node. For more information https://lnkd.in/d2upBA8Z
(PDF) A New Strategy for Reducing Latency with Deep Learning in Fog Computing Environment
researchgate.net
To view or add a comment, sign in
-
Monitor your Deep Learning models in production, 𝘂𝘀𝗶𝗻𝗴 𝘁𝗵𝗲𝘀𝗲 𝟭𝟭 𝗸𝗲𝘆 𝗺𝗲𝘁𝗿𝗶𝗰𝘀! Even before deploying a software system, developers require accurate insights on how it will perform under load, both locally and at scale. I’ve encountered this myself as I’ve tested and deployed a Computer Vision solution, only to see that the regular processing time jumped from 𝘂𝗻𝗱𝗲𝗿 𝟱 𝗵𝗼𝘂𝗿𝘀 to process 24 hours of video, 𝘁𝗼 𝟴 𝗵𝗼𝘂𝗿𝘀 in the pre-production/validation environment. 𝗛𝗲𝗿𝗲 𝗮𝗿𝗲 𝘁𝗵𝗲 𝘁𝗼𝗽 𝟭𝟭 𝗺𝗲𝘁𝗿𝗶𝗰𝘀 𝘆𝗼𝘂 𝘀𝗵𝗼𝘂𝗹𝗱 𝗸𝗲𝗲𝗽 𝗮𝗻 𝗲𝘆𝗲 𝗼𝗻: → 𝗻𝘃_𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲_𝗿𝗲𝗾𝘂𝗲𝘀𝘁_𝘀𝘂𝗰𝗰𝗲𝘀𝘀 Tracks successful inference requests, monitor server health and identify bottlenecks. → 𝗻𝘃_𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲_𝗿𝗲𝗾𝘂𝗲𝘀𝘁_𝗳𝗮𝗶𝗹𝘂𝗿𝗲 Counts failed inference requests to help quickly troubleshoot issues. → 𝗻𝘃_𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲_𝗰𝗼𝘂𝗻𝘁 Measures the total inferences processed, indicating server workload and throughput. → 𝗻𝘃_𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲_𝗲𝘅𝗲𝗰_𝗰𝗼𝘂𝗻𝘁 Reveals the demand on specific models, aiding in resource optimization. → 𝗻𝘃_𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲_𝗿𝗲𝗾𝘂𝗲𝘀𝘁_𝗱𝘂𝗿𝗮𝘁𝗶𝗼𝗻_𝘂𝘀 Monitors inference request completion time, crucial for meeting latency requirements. → 𝗻𝘃_𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲_𝗾𝘂𝗲𝘂𝗲_𝗱𝘂𝗿𝗮𝘁𝗶𝗼𝗻_𝘂𝘀 Identifies bottlenecks by tracking request queue times. → 𝗻𝘃_𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲_𝗰𝗼𝗺𝗽𝘂𝘁𝗲_𝗱𝘂𝗿𝗮𝘁𝗶𝗼𝗻_𝘂𝘀 Provides insights into processing efficiency and potential optimizations. → 𝗻𝘃_𝗴𝗽𝘂_𝘂𝘁𝗶𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 Shows how effectively GPU resources are utilized, crucial for scaling. → 𝗻𝘃_𝗴𝗽𝘂_𝗺𝗲𝗺𝗼𝗿𝘆_𝘁𝗼𝘁𝗮𝗹_𝗯𝘆𝘁𝗲𝘀 and 𝗻𝘃_𝗴𝗽𝘂_𝗺𝗲𝗺𝗼𝗿𝘆_𝘂𝘀𝗲𝗱_𝗯𝘆𝘁𝗲𝘀 Manage memory resources. → 𝗻𝘃_𝗲𝗻𝗲𝗿𝗴𝘆_𝗰𝗼𝗻𝘀𝘂𝗺𝗽𝘁𝗶𝗼𝗻 Provides stats on GPU energy consumption. → 𝗻𝘃_𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲_𝗹𝗼𝗮𝗱_𝗿𝗮𝘁𝗶𝗼 Offers insights into load distribution, helping with efficient resource use and load balancing. Using these metrics, building a Grafana dashboard is easy and allows you to cover all the necessary key insights to monitor your models' performance over time under different loads. 𝗖𝗮𝘁𝗰𝗵 𝘂𝗽: → When using NVIDIA Triton Inference Server as the model-serving framework, setting a dashboard to monitor your models is quite straightforward. → Monitoring is crucial in ML. → Triton comes with a Prometheus endpoint out of the box. If you’ve found it helpful, consider joining our 𝗗𝗲𝗰𝗼𝗱𝗶𝗻𝗴𝗠𝗟 𝗡𝗲𝘄𝘀𝗹𝗲𝘁𝘁𝗲𝗿 where I’ll publish a 101 tutorial on how to build this monitoring pipeline from the ground up. ↳ 🔗 https://shorturl.at/tuFN0 𝗥𝗶𝗻𝗴 𝘁𝗵𝗲 𝗯𝗲𝗹𝗹 🔔 for more 𝙥𝙧𝙖𝙘𝙩𝙞𝙘𝙖𝙡 𝙜𝙪𝙞𝙙𝙚𝙨 and 𝙗𝙞𝙩𝙨 𝙤𝙛 𝙖𝙙𝙫𝙞𝙘𝙚 in the #ML #ComputerVision #MLOps and #GenerativeAI fields.
To view or add a comment, sign in
-
CS Students @ College of Charleston- reach out to Nicolas Strojny and Joshua Hoover to learn more about Red Hat Academy and read more about Red Hat Research below:) AI/ML research is driving performance and efficiency gains and building better developer tools. "Red Hat Research and its university partners focus strategically on projects with the most promise to shape the future of how we use technology. Each quarter, RHRQ will publish an overview of our research in a specific area, such as edge computing, hybrid cloud, and security. In this issue, we focus on projects related to artificial intelligence and machine learning."
At Red Hat Research, work in #AI and machine learning is driving performance and efficiency gains and building better developer tools. Find out how we leverage our university partnerships to identify projects with the most promise to shape the future of how we use technology and learn about projects and milestones planned for 2024. https://red.ht/4cHqjEH
Focus on artificial intelligence and machine learning - Red Hat Research
https://research.redhat.com
To view or add a comment, sign in
-
A key area of focus at Red Hat Research is artificial intelligence (AI) and machine learning (ML). This includes building and optimizing systems that can run AI workloads, using AI/ML techniques to optimize systems software, and collaborating with the edge and hybrid cloud teams at Red Hat Research to use AI for their use cases. We collaborate closely with our university partners as well as Red Hat engineering. Click below to read this article by Sanjay A. - Sanjay Arora leads the AI agenda for Red Hat Research and is mainly interested in the application of machine learning to low-level systems. A broad theme we focus on is using AI/ML to learn heuristics and policies that lead to more optimal systems. Optimality here refers to a performance metric, such as tail latency, throughput, or resource consumption, with energy consumption being especially important. #AI #ML #Research
At Red Hat Research, work in #AI and machine learning is driving performance and efficiency gains and building better developer tools. Find out how we leverage our university partnerships to identify projects with the most promise to shape the future of how we use technology and learn about projects and milestones planned for 2024. https://red.ht/4cHqjEH
Focus on artificial intelligence and machine learning - Red Hat Research
https://research.redhat.com
To view or add a comment, sign in
-
Quantum Machine Learning in Finance 2 The primary goal of Machine Learning (ML) is to identify patterns within data without explicit instructions from a human operator. The only input provided by the external agent is the data itself. This data serves two purposes: discovering the underlying patterns and testing and refining those patterns. In the financial industry, data is naturally collected, such as customer-related information. Standard ML algorithms are commonly used to analyze these data. In this and the upcoming posts, we will discuss several examples and explore how quantum computing can enhance the performance of these algorithms. I will aim to keep the discussion accessible, using only elementary mathematical concepts. To begin, let's recall the two main types of ML algorithms in use today: 1. SUPERVISED LEARNING (SL) Supervised Learning is by far the most widely used ML method. It involves providing the machine (a classical computer) with a dataset consisting of data points {x} and their corresponding values {y}. The objective is for the machine to discover patterns between these two datasets. When new data, that we denote by {x'}, is introduced, the algorithm predicts the corresponding values {y'} based on the discovered pattern. The "learning process" occurs when a human supervisor compares the predicted values to the actual values. Using a reward/punishment system, the algorithm refines its pattern over time to improve accuracy. 2. UNSUPERVISED LEARNING (UL) In Unsupervised Learning, the algorithm is not given two datasets {x} and the corresponding {y}. Instead, it is only provided with a set of data {x}. The goal here is for the algorithm to discover hidden patterns within this dataset. The human agent then evaluates the algorithm's predictions to improve future outputs. For simplicity, let us focus on SUPERVISED LEARNING, specifically on Classification within Supervised Learning. CLASSIFICATION The purpose of a classification algorithm is to categorize data points in {x} based on their corresponding values in {y}. If there are more than two categories (N > 2), we refer to this as a MULTI-CLASS CLASSIFICATION. Data points {x} will belong to N subsets {(x, y₁)}, {(x, y₂)}, ..., {(x, y_N)}. If there are only two categories (N = 2), it is called BINARY CLASSIFICATION. In this case, the two classes are {(x, y₀)} and {(x, y₁)}, which can also be expressed as {(x, 0)} and {(x, 1)} or, equivalently, {(x, -)} and {(x, +)}. Consider binary classification. If neighboring points in {x} are associated with the same class (either "-" or "+"), then the distance between these points seems a reasonable measure for classification. Therefore, if a new point x' is surrounded by points in one of these classes, it is likely to belong to the same class. As we will discuss in the next post, ambiguity arises when the new point is not clearly within a defined group. #finance #machinelearning #quantumcomputing #ai #quantum
To view or add a comment, sign in
-
Je ne mesure pas entièrement toutes les subtilités techniques entre avoir un grand modèle ou un petit mais de manière général dans l'ère des IA, on ne devrait même pas se poser cette question qui sous entend une opposition ou une compétition entre les deux types de modèles. On devrait avoir compris que tous les types de modèles sont importants et surtout complémentaires. Notre choix du meilleur dépendra de chaque situation, chaque projet et de chaque besoin. L'avenir est à la diversité et la multiplicité d'IA. Chercher le meilleur modèle pour toutes les situations est une logique d'un autre temps je trouve! lol Et un modèle qui propose un juste milieu a vraiment toute sa place aussi, Méta étend la diversité et les applications possibles en choisissant cette voie. Ils ont compris certains besoins et y répondent parfaitement. C'est justement cette logique de répondre à des besoins d'applications qui devraient guider les ingénieurs.