Shrink Your Embeddings: Slashing Costs with MRL and BQL

Jo Kristian Bergum

Chief Scientist at Vespa.ai

Published Oct 18, 2024

Let's face it: vector embeddings are fantastic for many tasks, but if you've ever worked with large-scale vector search, you know the pain of watching your storage and compute costs skyrocket. But what if I told you there's a way to put your embeddings on a diet without sacrificing too much of their performance?

It's soon 2025, and I'm still seeing AI developers using chunky, fat float embeddings like it’s 2020. It's like using an expensive golden hammer when a simple toy hammer would give you the same results—you're burning compute and storage for minimal accuracy gain.

The Fat Float Embedding Dilemma

Vector search is powering everything from recommendation systems to semantic search. And with that comes a tsunami of data and ballooning costs. So why aren't more developers jumping on the embedding compression bandwagon? My guess? They either need to learn these techniques or think it's too complex to implement. And if you're still on the fence, let me show you what you're missing out on.

The power of compact embedding representations

Two techniques made waves in the world of embeddings and vector search in 2024:

Matryoshka Representation Learning (MRL)
Binary Quantization Learning (BQL)

These aren't just fancy acronyms – they're the path to slimmer, more efficient embedding vector representations. Let's break them down.

Matryoshka Representation Learning (MRL)

Think of MRL like those Russian nesting dolls (yes, that's where the name comes from). Instead of one fixed-size embedding, you get a hierarchy. Want the best possible accuracy? Use all the dimensions. Need something lighter for a small percentage drop in accuracy? Just grab the first 100 or so.

Key benefits:

Flexibility in choosing the number of dimensions
Linear reduction in computational complexity
Maintains good performance with fewer dimensions than all

Binary Quantization Learning (BQL)

If MRL is about selectively downsizing dimensions, BQL is about compressing the dimension values. It turns your float vectors into binary—we're talking just 0s and 1s.

Key benefits:

Massive reduction in storage requirements (32x compared to float vectors)
Significantly faster similarity computations using Hamming distance (20s faster)
Enables efficient scaling of vector datasets

Combining MRL and BQL

Combine these two techniques, and you've got a powerhouse of efficiency. The two techniques are compatible; first, MRL is performed, and then BQL is used to reduce the precision. MRL allows flexibility in the number of dimensions, while BQL provides flexibility in the precision per dimension. Perfectly compatible.

Storage savings: A concrete example

Let's put some numbers to this. Imagine you're storing 1 billion 1024-dimensional vectors:

That's right – you're looking at cost savings of up to 64x.

Search performance gains

These compact representations aren't just easier on your storage footprint; they're a boost for similarity searchers, too:

Hamming distance computations (used for binary vectors) are about 20 times faster than float dot products
This speed-up allows for higher query throughput or reduced CPU costs

In other words, you're saving money and cutting down latency by 20x, or you need 20x less CPU to get the job done.

Implementing in Vespa

Now, if you're using Vespa (and if you're not, you might want to consider it), you're in luck. Vespa supports both MRL and BQL in its native Hugging Face embedder. Here's a quick taste of what that looks like:

This little snippet creates a 64-dimensional binary embedding, combining MRL and BQL. Read more in this long blog post on MRL and BQL.

Real-world impact

So, what does all this mean in the real world? By shrinking your fat float embeddings, you can:

Reduce storage costs dramatically
Increase query throughput and lower latency
Enable new use cases that were previously too expensive
Implement more complex ranking pipelines within latency constraints

Yes, You're not just slashing costs and improving performance but opening up new possibilities to embed more data.

Conclusion

Look, I get it. Changing your embedding strategy might seem like a hassle. But if you ask me, the cost reduction is worth it. By leveraging techniques like MRL and BQL, you're not just trimming the fat float but unlocking new use cases.

So go ahead and shrink those embeddings. Your systems (and your budget) will thank you. And hey, you might just be the one to show your team how to save a boatload of cash next year.

Michael Cizmar

AI Search Expert | Leading MC+A

2mo

I thought there wasn't supposed to be any homework this weekend teacher.

1 Reaction

Xiaoze Jin

Data, Gen AI and Agentic AI Science Engineering

2mo

Love this Jo Kristian Bergum 🔥👏#vectorsearch #embedding #representation

1 Reaction

See more comments

To view or add a comment, sign in

See all

Shrink Your Embeddings: Slashing Costs with MRL and BQL

Jo Kristian Bergum

Chief Scientist at Vespa.ai

The power of compact embedding representations

Matryoshka Representation Learning (MRL)

Binary Quantization Learning (BQL)

Combining MRL and BQL

Storage savings: A concrete example

Conclusion

More articles by this author

Insights from the community

Others also viewed

LLM Evaluation, AI Side Projects, User-Friendly Data Tables, and Other October Must-Reads

MLJAR AutoML

Databricks AI/BI Series: A Technical Overview of AI/BI Genie

How your enterprise should use a vector database for its LLM apps - AI&YOU #54

GenAI's Act II and Vector Databases (Vol. 9)

Synerise open-sourcing Cleora AI framework for ultra-fast embeddings in large graphs

Machine Learning - MLflow for managing the end-to-end machine learning lifecycle

Categorical Encoding Techniques

Pico Jarvis: An LLM-based Chatbot Demo with RAG (Part 3)

Data Phoenix Digest - ISSUE 3.2023

Explore topics

The power of compact embedding representations

Matryoshka Representation Learning (MRL)

Binary Quantization Learning (BQL)

Combining MRL and BQL

Storage savings: A concrete example

Conclusion

Stop Using Vector Indexes (When You Don't Need Them)

Nov 11, 2024

A Practical Guide to Benchmarking Search Systems

Nov 8, 2024

Why separating compute from storage is a bad idea for late interaction models like ColPali

Oct 18, 2024

Insights from the community

Others also viewed

LLM Evaluation, AI Side Projects, User-Friendly Data Tables, and Other October Must-Reads

MLJAR AutoML

Databricks AI/BI Series: A Technical Overview of AI/BI Genie

How your enterprise should use a vector database for its LLM apps - AI&YOU #54

GenAI's Act II and Vector Databases (Vol. 9)

Synerise open-sourcing Cleora AI framework for ultra-fast embeddings in large graphs

Machine Learning - MLflow for managing the end-to-end machine learning lifecycle

Categorical Encoding Techniques

Pico Jarvis: An LLM-based Chatbot Demo with RAG (Part 3)

Data Phoenix Digest - ISSUE 3.2023

Explore topics