Benchmarking Warehouse Workloads On The Data Lake Using Presto

WHITE PAPER
Technical Brief:
Benchmarking Warehouse
Workloads on the Data
Lake using Presto
Technical Brief: How to run
a TPC-H benchmark on Presto
Summary
Presto is an open source distributed parallel query SQL engine that runs on a cluster of
nodes. It’s built for high performance and scalable interactive querying with in-memory
execution.
A full deployment of Presto has a coordinator and multiple workers. Queries are submitted
to the coordinator by a client like the command line interface (CLI), a BI tool, or a
notebook that supports SQL. The coordinator parses, analyzes and creates the optimal
query execution plan using metadata and data distribution information. That plan is then
distributed to the workers for processing. The advantage of this decoupled storage model is
that Presto is able to provide a single view of all of your data that has been aggregated into
the data storage tier.
Benchmarking is a critical component in an evaluation of Presto since it helps to identify

the system resource requirement and usage during various operations and its associated
latency metrics for such operations. It’s common practice to run a query on a small set of
data and conclude on performance, but that might not be true when running at scale. Also,
it’s good to look at overall performance of various operations at scale rather than a single
operation result.
In order to facilitate benchmarking, we’re going to use the open source tool benchto, part
of the Presto project, which can run the industry standard TPC-H database application.
In this tutorial, we will show you how to run benchto to benchmark Presto with TPC-H.
2
TPC-H
Based on tpc.org, TPC-H is a decision-support benchmark. It consists of a suite of business-oriented ad hoc
queries and concurrent data modifications. The queries and the data have been chosen for broad industry-
wide relevance. This benchmark illustrates decision support systems that examine large volumes of data,
execute queries with a high degree of complexity, and give answers to critical business questions.
TPC-H comes with various data set sizes to test different scaling factors.
SF (Gigabytes) Size
1 Consists of the base row size (several million elements).
10 Consists of the base row size x 10.
100 Consists of the base row size x 100 (several hundred million elements).
1000 Consists of the base row size x 1000 (several billion elements).
The TPC-H specification is a 137 page document, but to summarize:

• The database consists of a 3rd Normal Form (3NF) schema consisting of 8 tables.
• The benchmarks can be run using predetermined database sizes, referred to as “scale factors”. Each scale
factor corresponds to the raw data size of the data warehouse.
• 6 of the 8 tables grow linearly with the scale factor and are populated with data that is uniformly
distributed.
• 22 complex and long running query templates and 2 data refresh processes (insert and delete) are run in
parallel to test concurrency.
• The number of concurrent processes increases with the scale factor – for example, for the 100 GB
benchmark you run 5 concurrent processes.
TPC-H Workload
Type Features TPCH Business Questions
• Medium dimensionality Q1, Q3, Q4, Q5, Q6, Q7, Q8, Q12, Q13, Q14,
A
• Result is TPCH scale factor independent Q16, Q19, Q22
• High dimensionality
B Q15, Q18
• Few results, lot of empty cells
• High dimensionality
C Q2, Q9, Q10, Q11, Q17, Q20, Q21
• Result % of scale factor
3
Benchto tool
Benchto includes the following components:
• Benchto service, which acts as a persistent data store to store
the benchmark results.
• Benchto driver, a Java based application, loads the definition

of the benchmark descriptors and executes them on the
Presto cluster. If cluster monitoring is in place then it collects
resource metrics such as CPU, memory, network usage of the
cluster.
• Cluster resource usage monitoring is done using Grafana/

Graphite.
Benchto setup
Step 1:
Install and export JAVA 8 and JDK
Step 2:
Download benchto from github:
https://github.com/prestodb/benchto.git
Step 3:
Build the benchto project:
./mvnw clean install on benchto directory
Step 4:
Download prestodb from github https://github.com/prestodb/presto.git
Step 5:
Build the presto-benchto-benchmark:
./mvnw clean package -pl presto-benchto-benchmark
Step 6:
Create docker image for benchto-service from benchto directory
./mvnw package -pl benchto-service -P docker-images
4
Step 7:
Run the below command from benchto/benchto-service-docker directory to bring all required docker
instances for graphite,grafana,postgres and benchto
docker-compose -f docker-compose.yml up
Step 8:
Configure Grafana dashboard
In order to view performance metrics in Grafana you have to set up a data source (Graphite) and
dashboard. We have created an example dashboard that will show CPU, network and memory
performance from localhost.
Log into Grafana (user: admin, password: admin)
5
Navigate to Data Sources->Add data source and add Graphite data source as follows:
Navigate to Dashboard->Home and import dashboard from benchto/docs/getting-started/

dashboard-grafana.json
You should now see the Dashboard.

Click Save dashboard icon.
6
Step 9:
Generate schemas for TPC-H using superset or presto-cli
You can replace schema_name and bucket_name with the actual name you want to use
// SQL to create a new database

CREATE SCHEMA IF NOT EXISTS schema_name WITH (location=’s3a://bucket_name’);
// SQL to create new tables from TPC-H
CREATE TABLE schema_name.lineitem WITH (format=’PARQUET’) AS SELECT * FROM tpch.
sf100.lineitem;
CREATE TABLE schema_name.orders WITH (format=’PARQUET’) AS SELECT * FROM tpch.
sf100.orders;
CREATE TABLE schema_name.customer WITH (format=’PARQUET’) AS SELECT * FROM tpch.
sf100.customer;
CREATE TABLE schema_name.nation WITH (format=’PARQUET’) AS SELECT * FROM tpch.
sf100.nation;
CREATE TABLE schema_name.part WITH (format=’PARQUET’) AS SELECT * FROM tpch.
sf100.part;
CREATE TABLE schema_name.partsupp WITH (format=’PARQUET’) AS SELECT * FROM tpch.
sf100.partsupp;
CREATE TABLE schema_name.region WITH (format=’PARQUET’) AS SELECT * FROM tpch.
sf100.region;
CREATE TABLE schema_name.supplier WITH (format=’PARQUET’) AS SELECT * FROM tpch.
sf100.supplier;
Presto has a catalog named tpch by default. It has multiple schemas like tiny, sf1, sf100..sf1000 and
so on. As the number increases, the size of the table increases. Based on how much data you need
you can modify the above queries. I have written sf100 above but you can change it based on your
requirement.
7
Step 10:
Create application-presto-devenv.yaml in the directory where you plan to run the test, mostly it will be
from the prestodb directory where you cloned prestodb/presto-benchto-benchmarks.
cat application-presto-devenv.yaml
benchmark-service:
url: http://localhost:80
data-sources:
presto:
url: jdbc:presto://<YOUR PRESTO CLUSTER/hive
driver-class-name: com.facebook.presto.jdbc.PrestoDriver
environment:
name: PRESTO-DEVENV
presto:
url: http://<YOUR PRESTO CLUSTER>
benchmark:
feature:
presto:
metrics.collection.enabled: true
macros:
sleep-4s:
command: echo “Sleeping for 4s” && sleep 4
8
Step 11:
Edit prestodb/presto-benchto-benchmarks/src/main/resources/benchmarks/presto/tpch.
yaml to set the schema name you used in Step 9
cat prestodb/presto-benchto-benchmarks/src/main/resources/benchmarks/presto/tpch.
yaml
datasource: presto
query-names: presto/tpch/${query}.sql
runs: 10
prewarm-runs: 2
before-execution: sleep-4s, presto/session_set_cbo_flags.sql
frequency: 10
database: hive
tpch_300: <your schema name>
Here is a list of keywords that could be used to define the benchmark indicators in the above file.
Keyword Required Default value Comment
datasource True none Name of the datasource in application.yml file
query-names True none Path to the Queries
runs False 3 Number of runs each Query should be executed
Number of prewarm runs of queries before

Prewarm-runs False 0
benchmark
Number of concurrent workers - 1 sequential

concurrency False 1
benchmark, >1 concurrency benchmark
before-benchmark False none Names of macros executed before benchmark.
after-benchmark False none Names of macros executed after benchmark.
Names of macros executed before benchmark

before-execution False none
executions.
Names of macros executed after benchmark

after-execution False none
executions
variables False false Set of combinations of variables.
Flag which can be used to quarantine benchmark

quarantine False false
using –activeVariables property
Tells how frequent given benchmark can be executed

frequency False none
(in days) 1- once per day, 7 Once per week
9
Step 12:
Set env, make sure to replace prestoURL below to your presto cluster URL
cat set_env.sh
curl -H ‘Content-Type: application/json’ -d ‘{
“dashboardType”: “grafana”,
“dashboardURL”: “http://admin:admin@localhost:3000/dashboard/db/presto-
devenv”,
“prestoURL”: “https://prestohost:8080”
}’ http://localhost:80/v1/environment/PRESTO-DEVENV
curl -H ‘Content-Type: application/json’ -d ‘{

“name”: “Short tag desciption”,
“description”: “Very long but optional tag description”
}’ http://localhost:80/v1/tag/PRESTO-DEVENV
Step 13:
Run the benchmark test
java -Xmx1g -jar presto-benchto-benchmarks/target/presto-benchto-benchmarks-*-

executable.jar \
--sql presto-benchto-benchmarks/src/main/resources/sql \
--benchmarks presto-benchto-benchmarks/src/main/resources/benchmarks \
--activeBenchmarks=presto/tpch --profile=presto-devenv \
--overrides overrides.yaml
Step 14:
Check results from localhost:80. The results page will show the following:
• Test name which will be the yaml file used in Step 11,
• The hive or glue database name and schema used for this test
• Scale factors used for this test
• Join distribution type and strategy
• Time to complete the test.
10
If you drill down on the test name it will provide more details on the run such as duration, output and input
data size for the query, CPU time and planning time.
Aggregated measurements:
Name Mean StdDev Min Max
duration 6.142 s ± 0.000 ms (0.00 %) 6.142 s 6.142 s
prestoQuery-outputDataSize 175.000 B ± 0.000 B (0.00 %) 175.000 B 175.000 B
prestoQuery-
904.510 MB ± 0.000 B (0.00 %) 904.510 MB 904.510 MB
processedInputDataSize
prestoQuery-rawInputDataSize 904.510 MB ± 0.000 B (0.00 %) 904.510 MB 904.510 MB
prestoQuery-totalBlockedTime 35.860 m ± 0.000 ms (0.00 %) 35.860 m 35.860 m
prestoQuery-totalCpuTime 1.830 m ± 0.000 ms (0.00 %) 1.830 m 1.830 m
prestoQuery-
43.620 ms ± 0.000 ms (0.00 %) 43.620 ms 43.620 ms
totalPlanningTime
prestoQuery-
2.900 m ± 0.000 ms (0.00 %) 2.900 m 2.900 m
totalScheduledTime
11
Comparing the Results
Results can be compared for different runs using the UI by selecting the query and choosing two different
runs. In the example below you can see for query q03, there are runs for table partition with Ahana cache and
table partition without Ahana cache.
Choose the compare all option in the top right corner.
Doing so will result in a page, which will provide a chart for comparison.
Summary
Presto was designed for high performance analytics. The TPC-H benchmarking results will provide guidance
on what to expect when similar queries are run. There are many tunables that need to be configured to
achieve the best result, and this typically requires in-depth knowledge of Presto (which might be quite
daunting). In such instances, choosing to use a managed service like Ahana cloud can help, because it takes
care of the caching, configuration and tuning parameters under the hood.
You can get started with a 14 day free trial of Ahana at https://app.ahana.cloud/signup
12
Learn more about Ahana Cloud for Presto, the only
SaaS for Presto on AWS
What is Presto? More about the open source SQL

query engine
See why security leader Securonix chose Ahana to

power its open data lake analytics
Learn more about Ahana’s pay-as-you-pricing

through the AWS marketplace
Start your 14 day free trial of Ahana
Ahana, the only SaaS for Presto, offers the only managed service for Presto on
AWS with the vision to simplify open data lake analytics. Presto, the open source
project created by Facebook and used at Uber, Twitter and thousands more, is the
de facto standard for fast SQL processing on data lakes. Ahana Cloud delivers the
easiest Presto SaaS and enables data platform teams to provide high performance
SQL analytics on their S3 data lakes and other data sources. As a leading member
of the Presto community and Linux Foundation’s Presto Foundation, Ahana is also
focused on fostering growth and evangelizing open source Presto. Founded in 2020,
Ahana is headquartered in San Mateo, CA and operates as an all-remote company.
Investors include GV, Lux Capital, and Leslie Ventures.
©2022 Ahana Cloud, Inc., All rights reserved. reserved. The Ahana logo is an unregistered trademark of Ahana Cloud, Inc. All other
trademarks are property of their respective owners.
13

Benchmarking Warehouse Workloads On The Data Lake Using Presto

Uploaded by

Copyright:

Available Formats

Benchmarking Warehouse Workloads On The Data Lake Using Presto

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Benchmarking Warehouse Workloads On The Data Lake Using Presto

Uploaded by

Copyright:

Available Formats

WHITE PAPER

Benchmarking is a critical component in an evaluation of Presto since it helps to identify

1 Consists of the base row size (several million elements).

10 Consists of the base row size x 10.

The TPC-H specification is a 137 page document, but to summarize:

• Benchto driver, a Java based application, loads the definition

• Cluster resource usage monitoring is done using Grafana/

./mvnw clean install on benchto directory

./mvnw clean package -pl presto-benchto-benchmark

./mvnw package -pl benchto-service -P docker-images

Navigate to Dashboard->Home and import dashboard from benchto/docs/getting-started/

You should now see the Dashboard.

// SQL to create a new database

Keyword Required Default value Comment

datasource True none Name of the datasource in application.yml file

query-names True none Path to the Queries

runs False 3 Number of runs each Query should be executed

Number of prewarm runs of queries before

Number of concurrent workers - 1 sequential

before-benchmark False none Names of macros executed before benchmark.

after-benchmark False none Names of macros executed after benchmark.

Names of macros executed before benchmark

Names of macros executed after benchmark

variables False false Set of combinations of variables.

Flag which can be used to quarantine benchmark

Tells how frequent given benchmark can be executed

curl -H ‘Content-Type: application/json’ -d ‘{

java -Xmx1g -jar presto-benchto-benchmarks/target/presto-benchto-benchmarks-*-

• Scale factors used for this test

• Join distribution type and strategy

• Time to complete the test.

Name Mean StdDev Min Max

duration 6.142 s ± 0.000 ms (0.00 %) 6.142 s 6.142 s

prestoQuery-outputDataSize 175.000 B ± 0.000 B (0.00 %) 175.000 B 175.000 B

prestoQuery-rawInputDataSize 904.510 MB ± 0.000 B (0.00 %) 904.510 MB 904.510 MB

prestoQuery-totalBlockedTime 35.860 m ± 0.000 ms (0.00 %) 35.860 m 35.860 m

prestoQuery-totalCpuTime 1.830 m ± 0.000 ms (0.00 %) 1.830 m 1.830 m

Choose the compare all option in the top right corner.

What is Presto? More about the open source SQL

See why security leader Securonix chose Ahana to

Learn more about Ahana’s pay-as-you-pricing

Start your 14 day free trial of Ahana

You might also like