Presto SQL On Everything
Presto SQL On Everything
Presto SQL On Everything
Raghav Sethi, Martin Traverso∗ , Dain Sundstrom∗ , David Phillips∗ , Wenlei Xie, Yutian Sun,
Nezih Yigitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte∗ , Christopher Berner∗
Facebook, Inc.
Abstract—Presto is an open source distributed query engine authoring tools. The built-in Hive connector can natively read
that supports much of the SQL analytics workload at Facebook. from and write to distributed file systems such as HDFS and
Presto is designed to be adaptive, flexible, and extensible. It Amazon S3; and supports several popular open-source file
supports a wide variety of use cases with diverse characteristics.
These range from user-facing reporting applications with sub- formats including ORC, Parquet, and Avro.
second latency requirements to multi-hour ETL jobs that aggre- As of late 2018, Presto is responsible for supporting much
gate or join terabytes of data. Presto’s Connector API allows of the SQL analytic workload at Facebook, including interac-
plugins to provide a high performance I/O interface to dozens tive/BI queries and long-running batch extract-transform-load
of data sources, including Hadoop data warehouses, RDBMSs, (ETL) jobs. In addition, Presto powers several end-user facing
NoSQL systems, and stream processing systems. In this paper, we
outline a selection of use cases that Presto supports at Facebook. analytics tools, serves high performance dashboards, provides
We then describe its architecture and implementation, and call a SQL interface to multiple internal NoSQL systems, and
out features and performance optimizations that enable it to supports Facebook’s A/B testing infrastructure. In aggregate,
support these use cases. Finally, we present performance results Presto processes hundreds of petabytes of data and quadrillions
that demonstrate the impact of our main design decisions. of rows per day at Facebook.
Index Terms—SQL, query engine, big data, data warehouse
Presto has several notable characteristics:
I. I NTRODUCTION • It is an adaptive multi-tenant system capable of concur-
rently running hundreds of memory, I/O, and CPU-intensive
The ability to quickly and easily extract insights from large
queries, and scaling to thousands of worker nodes while
amounts of data is increasingly important to technology-
efficiently utilizing cluster resources.
enabled organizations. As it becomes cheaper to collect and
• Its extensible, federated design allows administrators to
store vast amounts of data, it is important that tools to query
this data become faster, easier to use, and more flexible. Using set up clusters that can process data from many different
a popular query language like SQL can make data analytics data sources even within a single query. This reduces the
accessible to more people within an organization. However, complexity of integrating multiple systems.
ease-of-use is compromised when organizations are forced • It is flexible, and can be configured to support a vast variety
to deploy multiple incompatible SQL-like systems to solve of use cases with very different constraints and performance
different classes of analytics problems. characteristics.
Presto is an open-source distributed SQL query engine that • It is built for high performance, with several key related
has run in production at Facebook since 2013 and is used today features and optimizations, including code-generation. Mul-
by several large companies, including Uber, Netflix, Airbnb, tiple running queries share a single long-lived Java Virtual
Bloomberg, and LinkedIn. Organizations such as Qubole, Machine (JVM) process on worker nodes, which reduces
Treasure Data, and Starburst Data have commercial offerings response time, but requires integrated scheduling, resource
based on Presto. The Amazon Athena1 interactive querying management and isolation.
service is built on Presto. With over a hundred contributors The primary contribution of this paper is to describe the design
on GitHub, Presto has a strong open source community. of the Presto engine, discussing the specific optimizations and
Presto is designed to be adaptive, flexible, and extensible. trade-offs required to achieve the characteristics we described
It provides an ANSI SQL interface to query data stored in above. The secondary contributions are performance results for
Hadoop environments, open-source and proprietary RDBMSs, some key design decisions and optimizations, and a description
NoSQL systems, and stream processing systems such as of lessons learned while developing and maintaining Presto.
Kafka. A ‘Generic RPC’2 connector makes adding a SQL Presto was originally developed to enable interactive query-
interface to proprietary systems as easy as implementing a ing over the Facebook data warehouse. It evolved over time to
half dozen RPC endpoints. Presto exposes an open HTTP support several different use cases, a few of which we describe
API, ships with JDBC support, and is compatible with sev- in Section II. Rather than studying this evolution, we describe
eral industry-standard business intelligence (BI) and query both the engine and use cases as they exist today, and call
∗ Author was affiliated with Facebook, Inc. during the contribution period.
out main features and functionality as they relate to these use
1 https://aws.amazon.com/athena cases. The rest of the paper is structured as follows. In Section
2 Using Thrift, an interface definition language and RPC protocol used for III, we provide an architectural overview, and then dive into
defining and creating services in multiple languages. system design in Section IV. We then describe some important
performance optimizations in Section V, present performance the A/B test infrastructure at Facebook is built on Presto. Users
results in Section VI, and engineering lessons we learned expect test results be available in hours (rather than days) and
while developing Presto in Section VII. Finally, we outline that the data be complete and accurate. It is also important for
key related work in Section VIII, and conclude in Section users to be able to perform arbitrary slice and dice on their
IX. Presto is under active development, and significant new results at interactive latency (∼5-30s) to gain deeper insights.
functionality is added frequently. In this paper, we describe It is difficult to satisfy this requirement by pre-aggregating
Presto as of version 0.211, released in September 2018. data, so results must be computed on the fly. Producing results
requires joining multiple large data sets, which include user,
II. U SE C ASES device, test, and event attributes. Query shapes are restricted
At Facebook, we operate numerous Presto clusters (with sizes to a small set since queries are programmatically generated.
up to ∼1000 nodes) and support several different use cases.
In this section we select four diverse use cases with large D. Developer/Advertiser Analytics
deployments and describe their requirements. Several custom reporting tools for external developers and
advertisers are built on Presto. One example deployment of
A. Interactive Analytics
this use case is Facebook Analytics3 , which offers advanced
Facebook operates a massive multi-tenant data warehouse analytics tools to developers that build applications which use
as an internal service, where several business functions and the Facebook platform. These deployments typically expose
organizational units share a smaller set of managed clusters. a web interface that can generate a restricted set of query
Data is stored in a distributed filesystem and metadata is stored shapes. Data volumes are large in aggregate, but queries
in a separate service. These systems have APIs similar to that are highly selective, as users can only access data for their
of HDFS and the Hive metastore service, respectively. We refer own applications or ads. Most query shapes contain joins,
to this as the ‘Facebook data warehouse’, and use a variant of aggregations or window functions. Data ingestion latency is
the Presto ‘Hive’ connector to read from and write to it. in the order of minutes. There are very strict query latency
Facebook engineers and data scientists routinely examine requirements (∼50ms-5s) as the tooling is meant to be inter-
small amounts of data (∼50GB-3TB compressed), test hy- active. Clusters must have 99.999% availability and support
potheses, and build visualizations or dashboards. Users often hundreds of concurrent queries given the volume of users.
rely on query authoring tools, BI tools, or Jupyter notebooks.
Individual clusters are required to support 50-100 concurrent III. A RCHITECTURE OVERVIEW
running queries with diverse query shapes, and return results
A Presto cluster consists of a single coordinator node and
within seconds or minutes. Users are highly sensitive to end-
one or more worker nodes. The coordinator is responsible
to-end wall clock time, and may not have a good intuition
for admitting, parsing, planning and optimizing queries as
of query resource requirements. While performing exploratory
well as query orchestration. Worker nodes are responsible for
analysis, users may not require that the entire result set be
query processing. Figure 1 shows a simplified view of Presto
returned. Queries are often canceled after initial results are
architecture.
returned, or use LIMIT clauses to restrict the amount of result
data the system should produce.
52470 Indices
• Since the engine switches between different splits from 50600
1
0 Dictionary
distinct task pipelines every quanta (Section IV-F1), the JIT 18866 "F" x 6 1
72387 0: "IN PERSON"
2 1: "COD"
would fail to optimize a common loop based implementa- 7429 0 2: "RETURN"
44077 2 3: "NONE"
tion since the collected profiling information for the tight
processing loop would be polluted by other tasks or queries. LongBlock RLEBlock DictionaryBlock
• Even within the processing loop for a single task pipeline, Page 1
the engine is aware of the types involved in each com- partkey returnflag shipinstruct
putation and can generate unrolled loops over columns. 164648
Indices
2
Eliminating target type variance in the loop body causes 35173 2
139350 "O" x 5 3 Dictionary
the profiler to conclude that call sites are monomorphic, 40227 0
87261 1
allowing it to inline virtual methods.
• As the bytecode generated for every task is compiled into LongBlock RLEBlock DictionaryBlock
a separate Java class, each can be profiled independently
Fig. 5. Different block types within a page
by the JIT optimizer. In effect, the JIT optimizer further
adapts a custom program generated for the query to the data D. Lazy Data Loading
actually processed. This profiling happens independently at Presto supports lazy materialization of data. This functionality
each task, which improves performance in environments can leverage the columnar, compressed nature of file formats
where each task processes a different partition of the data. such as ORC, Parquet, and RCFile. Connectors can generate
Furthermore, the performance profile can change over the lazy blocks, which read, decompress, and decode data only
lifetime of the task as the data changes (e.g., time-series when cells are actually accessed. Given that a large fraction
data or logs), causing the generated code to be updated. of CPU time is spent decompressing and decoding and that it
Generated bytecode also benefits from the second order ef- is common for filters to be highly selective, this optimization
fects of inlining. The JVM is able to broaden the scope of is highly effective when columns are infrequently accessed.
optimizations, auto-vectorize larger parts of the computation, Tests on a sample of production workload from the Batch
and can take advantage of frequency-based basic block layout ETL use case show that lazy loading reduces data fetched by
to minimize branches. CPU branch prediction also becomes far 78%, cells loaded by 22% and total CPU time by 14%.
more effective [7]. Bytecode generation improves the engine’s
ability to store intermediate results in registers or caches rather E. Operating on Compressed Data
than in memory [16]. Presto operates on compressed data (i.e. dictionary and run-
length-encoded blocks) sourced from the connector wherever
possible. Figure 5 shows how these blocks are structured
C. File Format Features
within a page. When a page processor evaluating a transfor-
Scan operators invoke the Connector API with leaf split mation or filter encounters a dictionary block, it processes all
information and receive columnar data in the form of Pages. of the values in the dictionary (or the single value in a run-
A page consists of a list of Blocks, each of which is a column length-encoded block). This allows the engine to process the
with a flat in-memory representation. Using flat memory data entire dictionary in a fast unconditional loop. In some cases,
structures is important for performance, especially for complex there are more values present in the dictionary than rows in
types. Pointer chasing, unboxing, and virtual method calls add the block. In this scenario the page processor speculates that
significant overhead to tight loops. the un-referenced values will be used in subsequent blocks.
Connectors such Hive and Raptor take advantage of specific The page processor keeps track of the number of real rows
file format features where possible [20]. Presto ships with produced and the size of the dictionary, which helps measure
custom readers for file formats that can efficiently skip data the effectiveness of processing the dictionary as compared to
sections by using statistics in file headers/footers (e.g., min- processing all the indices. If the number of rows is larger than
max range headers and Bloom filters). The readers can convert the size of the dictionary it is likely more efficient to process
certain forms of compressed data directly into blocks, which the dictionary instead. When the page processor encounters a
can be efficiently operated upon by the engine (Section V-E). new dictionary in the sequence of blocks, it uses this heuristic
Figure 5 shows the layout of a page with compressed encod- to determine whether to continue speculating.
ing schemes for each column. Dictionary-encoded blocks are Presto also leverages dictionary block structure when build-
very effective at compressing low-cardinality sections of data ing hash tables (e.g., joins or aggregations). As the indices
and run-length encoded (RLE) blocks compress repeated data. are processed, the operator records hash table locations for
Several pages may share a dictionary, which greatly improves every dictionary entry in an array. If the entry is repeated
memory efficiency. A column in an ORC file can use a single for a subsequent index, it simply re-uses the location rather
dictionary for an entire ‘stripe’ (up to millions of rows). than re-computing it. When successive blocks share the same
Worker Avg. CPU
50 25
# Queries
40 20
30 15
20 10
10 5
0 0
dictionary, the page processor retains the array to further to perform exploratory analysis over the Hadoop warehouse
20
40
60
80
100
120
140
160
180
200
220
reduce the necessary computation. and then load aggregate Time results or frequently-accessed
(min after period start) data
Presto also produces intermediate compressed results during into Raptor for faster analysis and low-latency dashboards.
execution. The join processor, for example, produces dictio- 700
array rather than copying the actual data. The operator simply 300
produces a dictionary block where the index list is that array, 200
and the dictionary is a reference to the block in the hash table. 100
0
VI. P ERFORMANCE q09 q18 q20 q26 q28 q35 q37 q44 q50 q54 q60 q64 q69 q71 q73 q76 q78 q80 q82
100
memory subset that does not require spilling.
We use Presto version 0.211 with internal variants of 75
the Hive/HDFS and Raptor connectors. Raptor is a shared-
CDF (%)
ec
ec
in
in
in
hr
1h
5h
m
6m
se
1m
4m
19
1s
4s
16
64
16
17
25
100
CPU Utilization Concurrency
50
or conflicting properties. This model poses its own set of
90 45 challenges. However, with a large number of clusters and
Worker Avg. CPU Utilization (%)
# Queries Running
70 35
operational investigations to the deployment process/tooling.
60 30
50 25 Vertical integration: Like other engineering teams, we design
40 20 custom libraries for components where performance and effi-
30 15
ciency are important. For example, custom file-format readers
20 10
10 5
allow us to use Presto-native data structures end-to-end and
0 0 avoid conversion overhead. However, we observed that the
0
20
40
60
80
100
120
140
160
180
200
220
ability to easily debug and control library behaviors is equally
Time (min after period start)
important when operating a highly multi-threaded system that
Fig. 8. Cluster avg. CPU utilization and concurrency over a 4-hour period
700
performs arbitrary computation in a long-lived process.
Adaptiveness
600 over configurability: As a complex multi- Consider an example of a recent production issue. Presto
Execution Time (seconds)
tenant query engine that executes arbitrary user defined com- uses the Java built-in gzip library. While debugging a se-
500
putation, Presto must be adaptive not only to different query quence of process crashes, we found that interactions between
400
characteristics, but also combinations of characteristics. For glibc and the gzip library (which invokes native code)
300
example, until Presto had end-to-end adaptive backpressure caused memory fragmentation. For specific workload combi-
200
(Section IV-E2), large amounts of memory and CPU was nations, this caused large native memory leaks. To address
100
utilized by a small number of jobs with slow clients, which this, we changed the way we use the library to influence the
0
adversely
q09 q18affected
q20 q26 q28 latency-sensitive
q35 q37 q44 q50 q54 q60jobs q64 q69that were
q71 q73 running
q76 q78 q80 q82 right cache flushing behavior, but in other cases we have gone
concurrently.Hive/HDFS
Without adaptiveness,
(no stats) it would
Hive/HDFS (table/column stats) be necessary to
Raptor as far as writing our own libraries for compression formats.
narrowly partition workloads and tune configuration for each Custom libraries can also improve developer efficiency –
40
workload independently. That approach would not scale to the reducing the surface area for bugs by only implementing
35 necessary features, unifying configuration management, and
Total Execution Time (min)