Profiling PyTorch language models with octoml-profile
Easily identify the best model/hardware combination
The recent launch of PyTorch 2.0 makes it clear that the community is heavily investing in a compiler-powered future for machine learning. The new OctoML Profiler can help any user realize the full potential of these shifts in the ML landscape.
Leveraging the latest PyTorch 2.0 compiler technology, octoml-profile automatically offloads models to cloud devices to generate a ‘profile’ of your application’s model. With these insights, you can make intelligent data-driven decisions for how to trade off between costs vs speed and find the right balance for your application’s requirements.
Here’s an example:
With a candidate list of 5 language models for semantic search on 3 hardware targets (Intel IceLake CPU, Graviton3 CPU and NVIDIA T4 GPU), the cost to run a million inferences with batch size 256 can range from $5 to $78. Single prediction inference latency also varies wildly from 5ms to 38ms.
In this post we’ll dig into this example and show you how to assess the cost and compute usage of accelerated PyTorch models before they are deployed to the cloud. We’ll start with a typical use case for octoml-profile: finding the ideal combination of model, hardware and acceleration library for your application’s latency and inference cost requirements.
Follow along with the companion Semantic Search tutorial notebook.
What is octoml-profile?
OctoML Profiler is an open source (Apache 2.0 licensed) python library and cloud service that simplifies the process of benchmarking PyTorch models with real inputs on remote hardware. By providing inference performance and hardware cost insights across model variants, acceleration libraries and hardware, octoml-profile gives you confidence in picking an ideal combo to deploy your model to the cloud.
Here’s how it works:
- First, sign up to generate an API token and make a few code changes — just add an @accelerate decorator to your predict function and a remote_profile() context manager.
- octoml-profile uses Torch Dynamo to extract computation subgraphs from the model.
- The subgraph segments that can be compiled and accelerated are sent to remote GPU and CPU inference servers for inference execution and measurement.
- Model segments that cannot be compiled, such as pre/post processing or 3rd party APIs, are run locally and timed.
- Finally, the compiled and uncompiled segment run times are summed up for total inference time for each requested hardware backend and acceleration library combination.
For example, in the above diagram, the PyTorch model is composed of 5 subgraphs; each subgraph is a logical portion of the model. There are two subgraphs (yellow and blue) that can be compiled and accelerated, so they are sent to the remotely hosted cloud hardware for inference measurement. The pre-processing code (in green), post-processing code (red) and subgraph-3 (black) cannot be compiled, so they are run locally on the developer’s laptop and timed. The compiled subgraphs are run on multiple cloud GPUs and CPUs for profiling across multiple hardware targets.
The results of running the above example through OctoML Profiler with the NVIDIA T4 and Intel Ice Lake backends would look like this:
Profile 1/1 ran 2 times with 10 repeats per call:
Segment Samples Avg ms Failures
===============================================================
0 Uncompiled 2 3.996
1 Graph #1
r6i.large/torch-eager-cpu 20 34.702 0
g4dn.xlarge/torch-eager-cuda 20 5.088 0
g4dn.xlarge/torch-inductor-cuda 20 3.221 0
g4dn.xlarge/onnxrt-cuda 20 1.421 0
2 Uncompiled 2 0.164
3 Graph #2
r6i.large/torch-eager-cpu 20 0.026 0
g4dn.xlarge/torch-eager-cuda 20 0.102 0
g4dn.xlarge/torch-inductor-cuda 20 0.248 0
g4dn.xlarge/onnxrt-cuda 20 0.053 0
4 Uncompiled 2 0.126
---------------------------------------------------------------
Total uncompiled code run time: 4.285 ms
Total times (compiled + uncompiled) and on-demand cost per million inferences per backend:
r6i.large/torch-eager-cpu (Intel Ice Lake) 39.013ms $1.37
g4dn.xlarge/torch-eager-cuda (Nvidia T4) 9.475ms $1.38
g4dn.xlarge/torch-inductor-cuda (Nvidia T4) 7.754ms $1.13
g4dn.xlarge/onnxrt-cuda (Nvidia T4) 5.759ms $0.84
Note that the “uncompiled” subgraphs above are precisely those which usually complicate attempts to profile model performance. This is because many profilers require model serialization, and the uncompiled segments above are challenging to serialize. With octoml-profile, you can profile smoothly without this friction.
Problem: Next Gen Search with Language Models
Follow along this example with the sample notebook.
Imagine you’re part of a team tasked with upgrading your application’s search feature. Maybe your application exposes large corpuses of text — anywhere from medical or legal records to tweets or emails. The existing search feature uses lexical matching to find documents, which just looks for literal matches of the query words. Your team is investigating upgrading to semantic search to improve the search accuracy over lexical matches. Semantic search tries to discover the contextual meaning of the terms and decipher the searcher’s intent. The way it does this is by looking for relatedness amount terms. It starts by first embedding all the text in the corpus into a vector space. Then when a search is run, the query is also embedded into the same vector space and the closest embeddings become the search results.
Here’s an example:
Query: A man is eating pasta.
Result: Top 5 most similar sentences in corpus:
A man is eating food. (Score: 0.7035)
A man is eating a piece of bread. (Score: 0.5272)
A man is riding a horse. (Score: 0.1889)
A man is riding a white horse on an enclosed ground. (Score: 0.1047)
A cheetah is running behind its prey. (Score: 0.0980)
The data scientists on your team experiment with several open-source language models and narrow down 5 candidate models that all fall within the range of accuracy/quality for the application’s needs. By leveraging open-source models, rather than closed-source APIs, you gain several advantages — you can customize or fine-tune the model to your specific needs without incurring exorbitant costs, you can deploy to any cloud or hardware type, and your customers’ data stays exclusively within your application.
The short list of capable models identified are:
Challenges with determining model and hardware combinations for application SLAs
The first set of challenges companies face after short-listing candidate models is determining which model is best for the task, based on inference latency and cost to run at scale. The goal is to identify the best hardware backend for each model. This is not an easy task. Pretrained model providers often attempt to provide performance information (the “speed” columns here), but they are limited in the evaluated accelerator/hardware, and don’t provide accurate time and cost measurements.
Choosing a model always involves a tradeoff between slow and expensive large models that return the highest quality results and fast and affordable small models that return lower quality results. Companies that naively choose the largest models, will have to eventually backtrack to a smaller/faster model as soon as they gain a sizable number of users to get production inference costs under control.
This is where octoml-profile comes in. octoml-profile makes it easy to iteratively profile the inference performance offered by combinations of model variants (different models and configurations such as batch size), hardware targets and acceleration libraries. At launch, OctoML Profiler supports multiple NVIDIA GPUs, an Intel IceLake CPU and a Graviton3 CPU. Notice that the cost to run each hardware type ranges from 10 cents to over $1.
In addition to exploring model by hardware combinations, octoml-profile also automatically runs acceleration libraries via ONNX Runtime and TorchInductor, such as CUDA, TensorRT and the CPU execution provider.
How to use octoml-profiler results to intelligently choose your hardware
Follow along in our companion notebook to add an @accelerate decorator to your inference function, wrapping calls in our remote_profile context manager. Then we’ll profile all 5 language models on 3 of the available hardware targets: Intel IceLake CPU, Graviton3 CPU and NVIDIA T4 GPU.:
When we print the data frame results, we get 70 rows. The combinatorial explosion of 5 models on 3 targets with 6 acceleration libraries is huge!
We’ll use the Vega-Altair statistical visualization library to create graphs to help interpret these results.
First we’ll make a bar chart to identify the cheapest backend per model for a high-throughput (batch size 256) use case:
Here are some takeaways from our hardware cost analysis for batch size 256 above:
- The dollar cost per million requests (x-axis) can range dramatically from about $5 for paraphrase-MiniLM-L3-v2 (purple) with r7g.large (Graviton3) to about $77 for all-distilroberta-v1 (green) with r6i.large (IceLake CPU).
- paraphrase-MiniLM-L3-v2 (purple) is generally the most affordable model to run across all hardwares
- Even on the same hardware, the cost of running a million requests can be 2x more if the wrong acceleration library is picked. Take a look at the all-MiniLM-L12-v2 (red) model’s first 4 bars for g4dn.xlarge (NVIDIA T4 GPU). ONNX Runtime w/ the CUDA EP can deliver a million inferences for about $15, while with TensorRT the costs increase to $30 for the same million inferences.
Next, we’ll create a similar bar chart for the fastest backend per model for 256 batch size:
Here are some takeaways from our fastest backend analysis for batch size 256 above:
- The Graviton3 CPU (r7g.large) is the slowest backend.
- The Intel IceLake CPU (r6i.large) is more than twice as fast as the Graviton3.
- The NVIDIA T4 GPU (g4dn.xlarge) is by far the fastest backend for all models.
Let’s dig in to just the NVIDIA T4 GPU for all 5 models:
Here we see that the ONNX-Runtime CUDA library delivers the fastest results for most models, except with all-distilroberta-v1, where TorchInductor w/ CUDA performs slightly better
That was a lot of information! Finally, let’s plot the inference time vs cost to get a better sense of which model/hardware/library tuple are the best combination choice to run production inference with for high-throughput (batch size 256):
Let’s zoom into the ideal candidate range:
Now we can clearly identify the purple circle, which corresponds to paraphrase-MiniLM-L3-v2 on the NVIDIA T4 GPU (g4dn) accelerated with ONNX Runtime’s CUDA EP. This combination delivers 60 ms inference for a batch of 256 and costs about $9 for a million requests.
Let’s consider a different use case. Instead of high-throughput with batch size of 256, perhaps we want to instead optimize for fast, single predictions with the lowest latency for each user’s request. As such, we will re-run the above analysis with batch size set to 1:
In this case, the fastest possible latencies range from 5ms to 7ms, and the cost per million requests from 5 to 70 cents with the paraphrase-MiniLM-L3-v2 model. Your organization can choose to pay a higher dollar amount for lower latency and deploy on Nvidia T4 using the ONNXRuntime CUDA EP. Alternatively, if you decide cost is more important and are looking for the cheapest option possible, deploying on Graviton3 with the ONNXRuntime CPU backend is a good option.
Profiling models for deployment is a necessary step to identify and reduce unnecessary bloat in cloud costs. The traditional approach for model profiling is a highly manual, uncertain process requiring weeks of engineer time — wrangling different acceleration libraries, hardware dependencies, and cloud provider headaches, not to mention the challenge of serializing models. The compiler-informed, modern approach is octoml-profile. Our tool provides a drastically simplified workflow requiring only 3 lines of additional code for users to get insights quickly.