Ziaul Kamal’s Post

Compressing LLMs With Quantum-Inspired Software https://lnkd.in/gfTxqiYr Large language models are inefficient, period. That’s apparent at AWS re:Invent this week. Inference is a hot topic, and conversations center on how to make the most of LLMs, considering the cost of training and the energy consumption required. Multiverse Computing, a company participating in the AWS Generative AI Accelerator, has developed ways to compress LLMs using quantum-inspired software. Based in San Sebastián, Spain, the company accelerates computing with quantum-inspired tensor networks, said founder and CEO Enrique Lizaso Olmos in an interview before AWS re:Invent. Tensor networks are powerful mathematical structures using “methods that attempt to use a classical computer to simulate the behavior of a quantum computer, thus making the classical machine operate algorithms that benefit from the laws of quantum mechanics that benefit real quantum computers,” according to a post by Pedro L e S Lopes, on the topic of quantum-inspired computing and how it compares to quantum computing. Consider the cost and energy needed to train models and perform inference. Multiverse compresses LLMs with techniques that, according to the company’s own published research, reduce by 93% the memory size of LlaMA-2-7B; it also reduces by 70% the number of parameters, accelerating 50% the training and 25% the inference by times of the model. Additionally, the accuracy drop is 2% to 3%. Multiverse, Lizaso said, works with many companies that have already tried LLMs but have found it expensive to deploy. The problem: LLMs need to be more efficient. They scale in parameters, but accuracy only improves linearly. The costs increase as more computing is used. Buying the GPUs is costly, and just as costly or even more costly is buying the GPUs from a cloud services provider. Multiverse started working with Bosch, a German engineering and technology company that wanted help with an on-premise AI system to reduce defects, Lizaso said. “So we applied our tensor networks,” Lizaso said. “We developed a completely new set of algorithms for machine learning. Well, it worked quite well. So we applied those same systems to finance and defense and so on. But at some point, and that was in 2023, we asked ourselves, can we just prepare a better system, a compressed system of large language models?” What’s the Future of Compression? When we come to the age of quantum computing, the compression will be sped up, so almost anything will have some form of embedded intelligence due to quantum computing’s ability to analyze vast amounts of data far beyond what is possible using classical computing methods. It acts unlike a classical computer, processing information in the binary sense of 1s and 0s, using a quantum mechanical property called superposition, explained Kimberly Mok in a previous post on The New Stack. It’s a bit mind-boggling, but in essence, information gets processed as either or...

To view or add a comment, sign in

Explore topics