Benchmarking Warehouse Workloads On The Data Lake Using Presto
Benchmarking Warehouse Workloads On The Data Lake Using Presto
Benchmarking Warehouse Workloads On The Data Lake Using Presto
Technical Brief:
Benchmarking Warehouse
Workloads on the Data
Lake using Presto
Technical Brief: How to run
a TPC-H benchmark on Presto
Summary
Presto is an open source distributed parallel query SQL engine that runs on a cluster of
nodes. It’s built for high performance and scalable interactive querying with in-memory
execution.
A full deployment of Presto has a coordinator and multiple workers. Queries are submitted
to the coordinator by a client like the command line interface (CLI), a BI tool, or a
notebook that supports SQL. The coordinator parses, analyzes and creates the optimal
query execution plan using metadata and data distribution information. That plan is then
distributed to the workers for processing. The advantage of this decoupled storage model is
that Presto is able to provide a single view of all of your data that has been aggregated into
the data storage tier.
In order to facilitate benchmarking, we’re going to use the open source tool benchto, part
of the Presto project, which can run the industry standard TPC-H database application.
In this tutorial, we will show you how to run benchto to benchmark Presto with TPC-H.
2
TPC-H
Based on tpc.org, TPC-H is a decision-support benchmark. It consists of a suite of business-oriented ad hoc
queries and concurrent data modifications. The queries and the data have been chosen for broad industry-
wide relevance. This benchmark illustrates decision support systems that examine large volumes of data,
execute queries with a high degree of complexity, and give answers to critical business questions.
TPC-H comes with various data set sizes to test different scaling factors.
SF (Gigabytes) Size
100 Consists of the base row size x 100 (several hundred million elements).
1000 Consists of the base row size x 1000 (several billion elements).
• The benchmarks can be run using predetermined database sizes, referred to as “scale factors”. Each scale
factor corresponds to the raw data size of the data warehouse.
• 6 of the 8 tables grow linearly with the scale factor and are populated with data that is uniformly
distributed.
• 22 complex and long running query templates and 2 data refresh processes (insert and delete) are run in
parallel to test concurrency.
• The number of concurrent processes increases with the scale factor – for example, for the 100 GB
benchmark you run 5 concurrent processes.
TPC-H Workload
Type Features TPCH Business Questions
• Medium dimensionality Q1, Q3, Q4, Q5, Q6, Q7, Q8, Q12, Q13, Q14,
A
• Result is TPCH scale factor independent Q16, Q19, Q22
• High dimensionality
B Q15, Q18
• Few results, lot of empty cells
• High dimensionality
C Q2, Q9, Q10, Q11, Q17, Q20, Q21
• Result % of scale factor
3
Benchto tool
Benchto includes the following components:
• Benchto service, which acts as a persistent data store to store
the benchmark results.
Benchto setup
Step 1:
Install and export JAVA 8 and JDK
Step 2:
Download benchto from github:
https://github.com/prestodb/benchto.git
Step 3:
Build the benchto project:
Step 4:
Download prestodb from github https://github.com/prestodb/presto.git
Step 5:
Build the presto-benchto-benchmark:
Step 6:
Create docker image for benchto-service from benchto directory
4
Step 7:
Run the below command from benchto/benchto-service-docker directory to bring all required docker
instances for graphite,grafana,postgres and benchto
docker-compose -f docker-compose.yml up
Step 8:
Configure Grafana dashboard
In order to view performance metrics in Grafana you have to set up a data source (Graphite) and
dashboard. We have created an example dashboard that will show CPU, network and memory
performance from localhost.
Log into Grafana (user: admin, password: admin)
5
Navigate to Data Sources->Add data source and add Graphite data source as follows:
6
Step 9:
Generate schemas for TPC-H using superset or presto-cli
You can replace schema_name and bucket_name with the actual name you want to use
Presto has a catalog named tpch by default. It has multiple schemas like tiny, sf1, sf100..sf1000 and
so on. As the number increases, the size of the table increases. Based on how much data you need
you can modify the above queries. I have written sf100 above but you can change it based on your
requirement.
7
Step 10:
Create application-presto-devenv.yaml in the directory where you plan to run the test, mostly it will be
from the prestodb directory where you cloned prestodb/presto-benchto-benchmarks.
cat application-presto-devenv.yaml
benchmark-service:
url: http://localhost:80
data-sources:
presto:
url: jdbc:presto://<YOUR PRESTO CLUSTER/hive
driver-class-name: com.facebook.presto.jdbc.PrestoDriver
environment:
name: PRESTO-DEVENV
presto:
url: http://<YOUR PRESTO CLUSTER>
benchmark:
feature:
presto:
metrics.collection.enabled: true
macros:
sleep-4s:
command: echo “Sleeping for 4s” && sleep 4
8
Step 11:
Edit prestodb/presto-benchto-benchmarks/src/main/resources/benchmarks/presto/tpch.
yaml to set the schema name you used in Step 9
cat prestodb/presto-benchto-benchmarks/src/main/resources/benchmarks/presto/tpch.
yaml
datasource: presto
query-names: presto/tpch/${query}.sql
runs: 10
prewarm-runs: 2
before-execution: sleep-4s, presto/session_set_cbo_flags.sql
frequency: 10
database: hive
tpch_300: <your schema name>
Here is a list of keywords that could be used to define the benchmark indicators in the above file.
9
Step 12:
Set env, make sure to replace prestoURL below to your presto cluster URL
cat set_env.sh
curl -H ‘Content-Type: application/json’ -d ‘{
“dashboardType”: “grafana”,
“dashboardURL”: “http://admin:admin@localhost:3000/dashboard/db/presto-
devenv”,
“prestoURL”: “https://prestohost:8080”
}’ http://localhost:80/v1/environment/PRESTO-DEVENV
Step 13:
Run the benchmark test
Step 14:
Check results from localhost:80. The results page will show the following:
• Test name which will be the yaml file used in Step 11,
• The hive or glue database name and schema used for this test
10
If you drill down on the test name it will provide more details on the run such as duration, output and input
data size for the query, CPU time and planning time.
Aggregated measurements:
prestoQuery-
904.510 MB ± 0.000 B (0.00 %) 904.510 MB 904.510 MB
processedInputDataSize
prestoQuery-
43.620 ms ± 0.000 ms (0.00 %) 43.620 ms 43.620 ms
totalPlanningTime
prestoQuery-
2.900 m ± 0.000 ms (0.00 %) 2.900 m 2.900 m
totalScheduledTime
11
Comparing the Results
Results can be compared for different runs using the UI by selecting the query and choosing two different
runs. In the example below you can see for query q03, there are runs for table partition with Ahana cache and
table partition without Ahana cache.
Doing so will result in a page, which will provide a chart for comparison.
Summary
Presto was designed for high performance analytics. The TPC-H benchmarking results will provide guidance
on what to expect when similar queries are run. There are many tunables that need to be configured to
achieve the best result, and this typically requires in-depth knowledge of Presto (which might be quite
daunting). In such instances, choosing to use a managed service like Ahana cloud can help, because it takes
care of the caching, configuration and tuning parameters under the hood.
You can get started with a 14 day free trial of Ahana at https://app.ahana.cloud/signup
12
Learn more about Ahana Cloud for Presto, the only
SaaS for Presto on AWS
Ahana, the only SaaS for Presto, offers the only managed service for Presto on
AWS with the vision to simplify open data lake analytics. Presto, the open source
project created by Facebook and used at Uber, Twitter and thousands more, is the
de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the
easiest Presto SaaS and enables data platform teams to provide high performance
SQL analytics on their S3 data lakes and other data sources. As a leading member
of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also
focused on fostering growth and evangelizing open source Presto. Founded in 2020,
Ahana is headquartered in San Mateo, CA and operates as an all-remote company.
Investors include GV, Lux Capital, and Leslie Ventures.
©2022 Ahana Cloud, Inc., All rights reserved. reserved. The Ahana logo is an unregistered trademark of Ahana Cloud, Inc. All other
trademarks are property of their respective owners.
13