A Practical Guide to Benchmarking Search Systems

Jo Kristian Bergum

Chief Scientist at Vespa.ai

Published Nov 8, 2024

In my early career days, I overheard a senior engineer saying that "we should deploy these systems well below the knee point of the hockey stick," and I didn't understand what he meant. Sure, I had been benchmarking before, even thought I was decent at it. My approach was simple: throw a benchmarking client at the system, benchmark with 100 clients, and get two numbers: QPS and latency. Then report and move on. If the QPS was too low or the latency too high, we'd optimize or throw hardware at it. It took years of experience (and guidance) to understand how to properly benchmark search systems.

Here's a practical guide to benchmarking search engines, without too much ceremony. This is a starter guide, but I repeatedly find myself explaining these basics so why not write an article about it.

Getting Started

Once you've built your search system with the required functionality and ranking strategy in place, you'll want to benchmark it to understand performance and hardware requirements. First, find a representative set of queries – understand how many terms users typically use when searching. Then, use an HTTP benchmarking client to generate load. During this process, monitor your servers, focusing on:

- CPU utilization

- Disk I/O

- Network I/O

The Single-Client Baseline

Start with no concurrency – a single user with the entire system available. If latency is already too high for your SLA, you'll need to cut features or add resources. This is also your opportunity to systematically test different features and queries to understand their performance impact. Here you can also compare queries, maybe longer queries that retrieves more documents are slower. Here it can also be useful to plot latency per query versus dimensions like the total number of hits, query length or other dimensions.

At this stage, examine utilization metrics. If any resource is already at 100%, you'll face challenges pushing higher throughput.

Increasing Concurrency

Next, simulate multiple users by increasing concurrency. Track latency and throughput as you add load. Continue until throughput plateaus and latency climbs sharply – you've found the knee of the hockey stick.

This is what that senior engineer meant: know your system's concurrency and throughput limits. Deploy well below the knee, because a small increase in concurrency beyond it can make your system unresponsive.

Identify your bottleneck at this point. Is it I/O, CPU, memory, or network? If none show high utilization, look for software bottlenecks like insufficient threads or synchronization issues. Understanding these constraints helps you plan scaling strategies – whether to add servers, change instance types, or optimize code.

While you could push thousands of concurrent clients to find maximum throughput, the resulting latency numbers would be meaningless. They're dominated by queue time. This was my biggest mistake in those early days, before understanding how to benchmark search systems.

Yashasvi Mantha

ML Engineer

1mo

How do we bring in config changes (that can rarerly make or break scaling) into benchmarking? Or while benchmarking it's not a good idea to use any other config than the default out of box stuff?

A Practical Guide to Benchmarking Search Systems

Jo Kristian Bergum

Chief Scientist at Vespa.ai

Getting Started

The Single-Client Baseline

Increasing Concurrency

More articles by this author

Insights from the community

Others also viewed

Dynamic Function Generation in Zsh

Understanding Time to First Byte (TTFB) and Why It Matters

HTTP Essential: A guide to understanding the Web

What Happens When You Type google.com Or Any Other URL in Your Browser and Press Enter

Unlocking the Magic of WebSockets and HTTP in Simple Terms

Why does your browser limit the number of concurrent network calls?

Open RAN: Challenges and Benefits of Deploying at Scale

Personal Web

How to test your web application on different internet speed?

Scalability and Load Balancing: Future-Proofing Your Applications 🚀

Explore topics

Getting Started

The Single-Client Baseline

Increasing Concurrency

Stop Using Vector Indexes (When You Don't Need Them)

Nov 11, 2024

Shrink Your Embeddings: Slashing Costs with MRL and BQL

Oct 18, 2024

Why separating compute from storage is a bad idea for late interaction models like ColPali

Oct 18, 2024

Insights from the community

Others also viewed

Dynamic Function Generation in Zsh

Understanding Time to First Byte (TTFB) and Why It Matters

HTTP Essential: A guide to understanding the Web

What Happens When You Type google.com Or Any Other URL in Your Browser and Press Enter

Unlocking the Magic of WebSockets and HTTP in Simple Terms

Why does your browser limit the number of concurrent network calls?

Open RAN: Challenges and Benefits of Deploying at Scale

Personal Web

How to test your web application on different internet speed?

Scalability and Load Balancing: Future-Proofing Your Applications 🚀

Explore topics