Compressing LLMs With Quantum-Inspired Software https://lnkd.in/gfTxqiYr Large language models are inefficient, period. That’s apparent at AWS re:Invent this week. Inference is a hot topic, and conversations center on how to make the most of LLMs, considering the cost of training and the energy consumption required. Multiverse Computing, a company participating in the AWS Generative AI Accelerator, has developed ways to compress LLMs using quantum-inspired software. Based in San Sebastián, Spain, the company accelerates computing with quantum-inspired tensor networks, said founder and CEO Enrique Lizaso Olmos in an interview before AWS re:Invent. Tensor networks are powerful mathematical structures using “methods that attempt to use a classical computer to simulate the behavior of a quantum computer, thus making the classical machine operate algorithms that benefit from the laws of quantum mechanics that benefit real quantum computers,” according to a post by Pedro L e S Lopes, on the topic of quantum-inspired computing and how it compares to quantum computing. Consider the cost and energy needed to train models and perform inference. Multiverse compresses LLMs with techniques that, according to the company’s own published research, reduce by 93% the memory size of LlaMA-2-7B; it also reduces by 70% the number of parameters, accelerating 50% the training and 25% the inference by times of the model. Additionally, the accuracy drop is 2% to 3%. Multiverse, Lizaso said, works with many companies that have already tried LLMs but have found it expensive to deploy. The problem: LLMs need to be more efficient. They scale in parameters, but accuracy only improves linearly. The costs increase as more computing is used. Buying the GPUs is costly, and just as costly or even more costly is buying the GPUs from a cloud services provider. Multiverse started working with Bosch, a German engineering and technology company that wanted help with an on-premise AI system to reduce defects, Lizaso said. “So we applied our tensor networks,” Lizaso said. “We developed a completely new set of algorithms for machine learning. Well, it worked quite well. So we applied those same systems to finance and defense and so on. But at some point, and that was in 2023, we asked ourselves, can we just prepare a better system, a compressed system of large language models?” What’s the Future of Compression? When we come to the age of quantum computing, the compression will be sped up, so almost anything will have some form of embedded intelligence due to quantum computing’s ability to analyze vast amounts of data far beyond what is possible using classical computing methods. It acts unlike a classical computer, processing information in the binary sense of 1s and 0s, using a quantum mechanical property called superposition, explained Kimberly Mok in a previous post on The New Stack. It’s a bit mind-boggling, but in essence, information gets processed as either or...
Ziaul Kamal’s Post
More Relevant Posts
-
Google AI Announces Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters Researchers from UC Berkeley, and Google DeepMind propose an adaptive “compute-optimal” strategy for scaling test-time computing in LLMs. This approach selects the most effective method for utilizing additional computation based on the specific prompt and question difficulty. By utilizing a measure of question difficulty from the base LLM’s perspective, the researchers can predict the efficacy of test-time computation and implement this compute-optimal strategy in practice. This adaptive allocation of test-time compute significantly improves scaling performance, surpassing best-of-N baselines while using approximately 4 times less computation for both revision and search methods. The researchers then compare the effectiveness of their improved test-time compute scaling strategy against the alternative of pretraining larger models. The use of additional test-time computation in LLMs can be viewed through a unified perspective of modifying the model’s predicted distribution adaptively at test-time. This modification can be achieved through two main approaches: altering the proposal distribution and optimizing the verifier. To improve the proposal distribution, researchers have explored methods such as RL-inspired finetuning (e.g., STaR, ReSTEM) and self-critique techniques. These approaches enable the model to enhance its own outputs at test time by critiquing and revising its initial responses iteratively. Finetuning models on on-policy data with Best-of-N guided improvements have shown promise in complex reasoning tasks. Read our full take on this: https://lnkd.in/gB3Vzxrq Paper: https://lnkd.in/gbckaUuR Google DeepMind Charlie Snell Jaehoon Lee
To view or add a comment, sign in
-
Evaluating Large Language Models (LLMs) is essential to ensure they meet the specific needs of diverse use cases while maintaining responsible AI practices. In this blog, we delve into how Amazon Web Services (AWS) Bedrock and FMEval facilitate the evaluation of Anthropic Claude 3 models. These tools provide a comprehensive framework for assessing model performance, helping data scientists and ML engineers to streamline their evaluation processes. The ability to conduct large-scale evaluations with built-in and custom algorithms makes FMEval an invaluable resource for those working with LLMs.
Evaluate Anthropic Claud 3 Models with AWS Bedrock & FMEval Custom Model Runner - Info Services
https://blogs.infoservices.com
To view or add a comment, sign in
-
AI Compute is the New Oil: Accessible for Everyone with Dedication and Passion In the early 20th century, oil was the lifeblood of the global economy, Now a new resource is driving innovation and growth: artificial intelligence (AI) compute. The Rise of AI Compute Artificial intelligence is no longer a futuristic concept; it is an integral part of our daily lives. AI is reshaping our world. At the heart of these AI systems lies a powerful computational infrastructure—AI compute. AI compute refers to the processing power and resources required to run complex machine learning models and algorithms. This includes specialized hardware like GPUs (Graphics Processing Units) and TPUs (Tensor Processing Units), as well as cloud-based platforms. Democratisation of AI Compute Today, anyone with an internet connection can access powerful AI compute resources. Cloud Platforms: Equal Opportunity for All Cloud platforms like Amazon Web Services (AWS),(GCP), OCI, IBM and Azure offer AI compute resources on a pay-as-you-go basis. This model allows individuals, startups, and small businesses to leverage the same computational power as tech giants, without the need for significant upfront investment. Open-Source Ecosystem: Building Together The open-source community has played a pivotal role in making AI compute accessible. Libraries and frameworks such as TensorFlow, PyTorch, and Scikit-learn provide the tools needed to develop sophisticated AI models. These resources are freely available, enabling anyone with the willingness to learn and experiment to dive into AI. Dedication and Passion: The Key Ingredients Here are some key steps for individuals and organizations looking to harness the power of AI compute: 1. Continuous Learning AI is a rapidly evolving field. Staying updated with the latest advancements, techniques, and tools is crucial. Online courses, tutorials, and academic papers are valuable resources for continuous learning. 2. Experimentation and Innovation Experimentation is at the heart of AI development. 3. Collaboration and Community Engagement The AI community is vast and collaborative and contributing to open-source projects can accelerate learning and foster innovation. 4. Ethical Considerations As AI technology advances, it is essential to consider the ethical implications of its use. Ensuring that AI systems are fair, transparent, and unbiased. Case Studies: Success Stories A. Startups Leveraging Cloud AI Several startups have successfully leveraged cloud AI compute to disrupt industries. For example, healthcare. B. Academic Institutions and Research Academic institutions are utilizing AI compute to advance research. C. Individual Innovators Individual innovators and hobbyists are also making remarkable contributions to exploring AI. The Future of AI Compute The future of AI compute looks promising and new OIL. Conclusion AI compute is indeed the new oil, powering the engines of modern innovation and transformation.
To view or add a comment, sign in
-
💡 Did you know that the secret sauce behind the massive scale of modern Large Language Models lies in their ability to make intelligent "speculations"? Excited to share my latest blog post with Syl Taylor on AWS Machine Learning Blog: Faster LLMs with speculative decoding and AWS Inferentia2, which explores how speculative decoding can help make LLM inference more compute-efficient and cost-effective on AWS Inferentia2, boosting throughput and reducing output latency. Speculative decoding is a game-changer, empowering the greatest LLMs to reach unprecedented heights. In this blog, we show how to benefit from this powerful feature for accelerating LLM inference on AWS. It's a testament to the continuous innovation happening at AWS to support the evolving needs of our customers in the AI/ML space. Great collaboration with co-author Syl Taylor, contributors Amulya Ballakur, Shruti Koparkar, Pinak Panigrahi, Michele Ricciardi and kudos to our technical leaders Gadi Hutt, Kamran Khan for their guidance. Definitely more to come in this space! #AWSInferentia #MachineLearning #AIChips #GenerativeAI https://lnkd.in/e8A4nqFM
Faster LLMs with speculative decoding and AWS Inferentia2 | Amazon Web Services
aws.amazon.com
To view or add a comment, sign in
-
🗺️🧭 𝐖𝐡𝐚𝐭'𝐬 𝐭𝐡𝐞 𝐛𝐞𝐬𝐭 𝐋𝐋𝐌 𝐢𝐧𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐬𝐨𝐥𝐮𝐭𝐢𝐨𝐧 𝐟𝐨𝐫 𝐦𝐞? If you start with Language Models in 2024, you can quickly be overwhelmed by the different inference solutions you can find. So I'm trying to create a simple map of this vast landscape. And guess what? 😉 Most of these solutions are supported by the Haystack LLM framework. --> Guide: https://lnkd.in/dvyghEys 🔒 𝐏𝐫𝐨𝐩𝐫𝐢𝐞𝐭𝐚𝐫𝐲 𝐌𝐨𝐝𝐞𝐥𝐬 This is a quick way to start exploring and building. You pay per consumed token and data exits your machine. - Providers: OpenAI, Azure, Google, Anthropic, Cohere, Mistral... - Amazon Bedrock provides support for several proprietary models under a unified platform. 📦 𝐎𝐩𝐞𝐧 𝐦𝐨𝐝𝐞𝐥𝐬 𝐡𝐨𝐬𝐭𝐞𝐝 𝐛𝐲 𝐩𝐫𝐨𝐯𝐢𝐝𝐞𝐫𝐬 You use models with public weights, but the hosting is done by the providers. Beware of the license: ⚠️ not all open models are suitable for commercial use. 👥 Shared hosted models An instance of the model shared with other users. You pay per consumed token. - HF Inference API for experimentation - several cloud providers are OpenAI compatible, so switching to open models is very easy. Examples: Anyscale, Deep Infra, Fireworks, Lemonfox.ai, OctoAI, Together AI... 👤 Privately hosted models A private instance of the model is deployed by the provider. You typically pay per hour. More expensive but dedicated resources. - Amazon SageMaker - HF Inference Endpoints 🏠 𝐎𝐩𝐞𝐧 𝐦𝐨𝐝𝐞𝐥𝐬 𝐎𝐧-𝐏𝐫𝐞𝐦𝐢𝐬𝐞 You host open models on your machine/infrastructure. Ideal for local experimentation and production scenarios where data privacy is important and you have GPU resources. 🧪 Local experimentation - If you have GPU (Colab): HF Transformers library. Some quantization options (bitsandbytes, GPTQ, AWQ) are available to reduce consumed resources. - CPU (+GPU if available): Llama.cpp, Ollama... With these libraries, LLM can run on standard machines. GGUF quantized format is used. ⚙️⚡️ Serving LLMs in production Ample GPU resources needed. The following solutions use innovative techniques for fast inference and efficient handling of numerous concurrent requests. - vLLM - HF TGI (permissive but not fully open source license) --- 💬 This map is based on my personal experience and research. I know it's not comprehensive. Have I missed significant solutions? Let me know in the comments! #llm #largelanguagemodels #haystack #nlp #genai
To view or add a comment, sign in
-
On March 4, 2024, AWS announced access to the most powerful Anthropic models on Amazon Bedrock begins today, with industry-leading Claude 3 Sonnet now available and the full family of models coming soon. The cutting-edge Claude 3 family of multimodal models, including Opus (coming soon), Sonnet (now available), and Haiku (coming soon), set new industry benchmarks across reasoning, math, coding, multi-lingual understanding, and vision, while also improving on accuracy and speed. In addition, the new state-of-the-art models provide even more flexibility so customers can choose the exact combination of intelligence, speed, and cost for their business needs to process and analyze data at scale. This announcement is a key next step in advancing Amazon Bedrock as the easiest way to build and scale generative AI applications with large language models (LLMs) and other foundation models (FMs). It also deepens Anthropic's collaboration with AWS, continuing to make its LLMs accessible to millions of AWS customers through Amazon Bedrock. We are excited to continue this collaboration by making Anthropic's industry-leading Claude 3 family of models available on Amazon Bedrock. You can learn more about what customers are building with Claude 3 models on Amazon Bedrock in the AWS ML blog. https://lnkd.in/e9v2Az8g
Unlocking Innovation: AWS and Anthropic push the boundaries of generative AI together | Amazon Web Services
aws.amazon.com
To view or add a comment, sign in
-
AWS AI/ML Accelerators: A Deep Dive: Accelerated Computing Photo by ZHENYU LUO on Unsplash Subscribe and watch a detailed video on YouTube.https://https://lnkd.in/dNDf6rna AI is advancing at an incredible pace, powering technologies that were once the stuff of science fiction. From personal assistants that understand and respond to natural language commands to self-driving cars that navigate complex environments with precision, AI is reshaping how we live and work. But as remarkable as these advancements are, they come with their own set of challenges. Modern AI and machine learning (ML) models are becoming increasingly complex, requiring massive computational resources to function effectively. Training these models can take days — or even weeks — leading to delays and significantly increasing costs. And the challenges don’t stop there. Deploying these models for real-time tasks, like responding to user queries or making predictions from live data, demands immense processing power to ensure lightning-fast responses and smooth performance. Enter AI/ML accelerators — a game-changer in the field of artificial intelligence. Specifically designed for AI workloads, these purpose-built accelerators provide the raw computational power and efficiency needed to train and deploy sophisticated models more quickly and cost-effectively. By bridging the gap between the growing demands of AI and the limitations of traditional hardware, these accelerators are driving a new era of innovation. In this blog, we’ll dive into the world of AWS AI/ML accelerators, exploring their architecture, core components, and how they’re enabling organizations to push the boundaries of what AI can achieve.The Rise of AI/ML and the Need for Acceleration Types of AI/ML Accelerators There are three main types of hardware commonly used: GPUs are highly parallel processors initially designed to handle the complex calculations required for graphics rendering. However, their parallel processing capabilities have made them highly adaptable for AI/ML workloads, where large amounts of data need to be processed simultaneously. FPGAs offer a unique advantage:. Unlike GPUs or CPUs, the hardware logic within an FPGA can be reprogrammed to tailor the chip to specific AI/ML algorithms. This allows for greater flexibility and potentially higher efficiency for specific tasks. ASICs take specialisation to the next level. These chips are custom-designed for specific AI workloads, such as image recognition or natural language processing. AWS Trainium and Inferentia are prime examples of ASICs tailored for AI/ML tasks. Their architecture is optimised to maximise performance and efficiency for their intended purpose.Types of AI/ML Accelerators Annapurna Labs and the Genesis of AWS AI/ML Chips The story of AWS’s AI/ML chips begins with Annapurna Labs. This innovative company, founded… #genai #generativeai #ai
AWS AI/ML Accelerators: A Deep Dive
generativeai.pub
To view or add a comment, sign in
-
TensorOpera Unveils Fox Foundation Model: A Unique Step in Small Language Models Enhancing Scalability and Efficiency for Cloud and Edge Computing https://ift.tt/JFQEjlp TensorOpera has announced the launch of its groundbreaking small language model, Fox-1, through an official press release. This innovative model represents a significant step forward in small language models (SLMs), setting new benchmarks for scalability and performance in generative AI, particularly for cloud and edge computing applications. Fox-1-1.6B boasts a 1.6 billion parameter architecture, distinguishing it from other SLMs due to its superior performance and efficiency. The model has been meticulously designed to cater to the needs of developers and enterprises aiming for scalable and efficient AI deployment. It surpasses similar models from industry giants such as Apple, Google, and Alibaba. A key feature of Fox-1 is its integration into TensorOpera’s AI and FedML platforms. This integration facilitates the deployment, training, and creation of AI applications across various platforms and devices, ranging from high-powered GPUs in the cloud to edge devices like smartphones and AI-enabled PCs. This versatility underscores TensorOpera’s commitment to providing a scalable, generative AI platform that enhances ownership and efficiency across diverse computing environments. Image Source SLMs, including Fox-1, offer several advantages over larger language models (LLMs). They are designed to operate with significantly reduced latency and require less computational power, making them ideal for environments with limited resources. This efficiency translates into faster data processing and lower costs, which is critical for deploying AI in various settings, from mobile devices to server-constrained environments. Fox-1 is particularly noteworthy for its incorporation into composite AI architectures like Mixture of Experts (MoE) and model federation systems. These configurations leverage multiple SLMs working together to create more powerful systems capable of handling complex tasks such as multilingual processing and predictive analytics from various data sources. Fox-1’s architecture is a decoder-only transformer-based model with 1.6 billion parameters, trained on a comprehensive dataset comprising 3 trillion tokens of text and code data. The model’s design includes Grouped Query Attention (GQA), enhancing its query processing efficiency and significantly improving inference latency and response times. This advanced architectural design allows Fox-1 to outperform competitors on standard benchmarks, demonstrating its robustness and capability. Image Source Performance evaluations reveal that Fox-1 excels in various benchmarks, including ARC Challenge, HellaSwag, TruthfulQA, MMLU, Winogrande, and GSM8k. It consistently outperforms models like Gemma-2B, Qwen1.5-1.8B, StableLM-2-1.6B, and OpenELM1.1B, showcasing its superior performance despite having fewer para...
TensorOpera Unveils Fox Foundation Model: A Unique Step in Small Language Models Enhancing Scalability and Efficiency for Cloud and Edge Computing https://ift.tt/JFQEjlp TensorOpera has announced the launch of its groundbreaking small language model, Fox-1, through an official press release. This innovative model represents a significant step forward in small language models (SLMs), setting...
https://www.marktechpost.com
To view or add a comment, sign in
-
AI agents can perceive their environment using large language models to make smart decisions and take action. Check out this tutorial for building a research assistance agent using MongoDB, Fireworks AI, and LangChain. https://lnkd.in/dPi-mKVs
Building an AI Agent With Memory Using MongoDB, Fireworks AI, and LangChain | MongoDB
mongodb.com
To view or add a comment, sign in
-
💡 Do you think that LLMs are not very useful for time series forecasting tasks? Think again! Amazon Science researchers now demonstrate that tokenizing time series data and treating it like a language enables a model whose zero-shot performance matches or exceeds that of purpose-built models. Meet Chronos, a family of pretrained time series models based on language model architectures. Similar to large language models, Chronos is a foundation model which learns from large datasets how to produce general representations useful for a wide range of tasks. The key insight behind Chronos is treating time series data as a language to be modeled by off-the-shelf transformer architectures. In a comprehensive evaluation involving 42 datasets, Chronos significantly outperformed classical statistical methods, as well as specialized deep-learning models, on data held out from its training sets. You can read all about this in the blog post below ⬇ #TimeSeries #TimeSeriesForecasting #LanguageModels #MachineLearning #ArtificialIntelligence #GenerativeAI #Amazon #AWS
Adapting language model architectures for time series forecasting
amazon.science
To view or add a comment, sign in