Presto SQL On Everything

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Presto: SQL on Everything

Raghav Sethi, Martin Traverso∗ , Dain Sundstrom∗ , David Phillips∗ , Wenlei Xie, Yutian Sun,
Nezih Yigitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte∗ , Christopher Berner∗
Facebook, Inc.

Abstract—Presto is an open source distributed query engine authoring tools. The built-in Hive connector can natively read
that supports much of the SQL analytics workload at Facebook. from and write to distributed file systems such as HDFS and
Presto is designed to be adaptive, flexible, and extensible. It Amazon S3; and supports several popular open-source file
supports a wide variety of use cases with diverse characteristics.
These range from user-facing reporting applications with sub- formats including ORC, Parquet, and Avro.
second latency requirements to multi-hour ETL jobs that aggre- As of late 2018, Presto is responsible for supporting much
gate or join terabytes of data. Presto’s Connector API allows of the SQL analytic workload at Facebook, including interac-
plugins to provide a high performance I/O interface to dozens tive/BI queries and long-running batch extract-transform-load
of data sources, including Hadoop data warehouses, RDBMSs, (ETL) jobs. In addition, Presto powers several end-user facing
NoSQL systems, and stream processing systems. In this paper, we
outline a selection of use cases that Presto supports at Facebook. analytics tools, serves high performance dashboards, provides
We then describe its architecture and implementation, and call a SQL interface to multiple internal NoSQL systems, and
out features and performance optimizations that enable it to supports Facebook’s A/B testing infrastructure. In aggregate,
support these use cases. Finally, we present performance results Presto processes hundreds of petabytes of data and quadrillions
that demonstrate the impact of our main design decisions. of rows per day at Facebook.
Index Terms—SQL, query engine, big data, data warehouse
Presto has several notable characteristics:
I. I NTRODUCTION • It is an adaptive multi-tenant system capable of concur-
rently running hundreds of memory, I/O, and CPU-intensive
The ability to quickly and easily extract insights from large
queries, and scaling to thousands of worker nodes while
amounts of data is increasingly important to technology-
efficiently utilizing cluster resources.
enabled organizations. As it becomes cheaper to collect and
• Its extensible, federated design allows administrators to
store vast amounts of data, it is important that tools to query
this data become faster, easier to use, and more flexible. Using set up clusters that can process data from many different
a popular query language like SQL can make data analytics data sources even within a single query. This reduces the
accessible to more people within an organization. However, complexity of integrating multiple systems.
ease-of-use is compromised when organizations are forced • It is flexible, and can be configured to support a vast variety
to deploy multiple incompatible SQL-like systems to solve of use cases with very different constraints and performance
different classes of analytics problems. characteristics.
Presto is an open-source distributed SQL query engine that • It is built for high performance, with several key related
has run in production at Facebook since 2013 and is used today features and optimizations, including code-generation. Mul-
by several large companies, including Uber, Netflix, Airbnb, tiple running queries share a single long-lived Java Virtual
Bloomberg, and LinkedIn. Organizations such as Qubole, Machine (JVM) process on worker nodes, which reduces
Treasure Data, and Starburst Data have commercial offerings response time, but requires integrated scheduling, resource
based on Presto. The Amazon Athena1 interactive querying management and isolation.
service is built on Presto. With over a hundred contributors The primary contribution of this paper is to describe the design
on GitHub, Presto has a strong open source community. of the Presto engine, discussing the specific optimizations and
Presto is designed to be adaptive, flexible, and extensible. trade-offs required to achieve the characteristics we described
It provides an ANSI SQL interface to query data stored in above. The secondary contributions are performance results for
Hadoop environments, open-source and proprietary RDBMSs, some key design decisions and optimizations, and a description
NoSQL systems, and stream processing systems such as of lessons learned while developing and maintaining Presto.
Kafka. A ‘Generic RPC’2 connector makes adding a SQL Presto was originally developed to enable interactive query-
interface to proprietary systems as easy as implementing a ing over the Facebook data warehouse. It evolved over time to
half dozen RPC endpoints. Presto exposes an open HTTP support several different use cases, a few of which we describe
API, ships with JDBC support, and is compatible with sev- in Section II. Rather than studying this evolution, we describe
eral industry-standard business intelligence (BI) and query both the engine and use cases as they exist today, and call
∗ Author was affiliated with Facebook, Inc. during the contribution period.
out main features and functionality as they relate to these use
1 https://aws.amazon.com/athena cases. The rest of the paper is structured as follows. In Section
2 Using Thrift, an interface definition language and RPC protocol used for III, we provide an architectural overview, and then dive into
defining and creating services in multiple languages. system design in Section IV. We then describe some important
performance optimizations in Section V, present performance the A/B test infrastructure at Facebook is built on Presto. Users
results in Section VI, and engineering lessons we learned expect test results be available in hours (rather than days) and
while developing Presto in Section VII. Finally, we outline that the data be complete and accurate. It is also important for
key related work in Section VIII, and conclude in Section users to be able to perform arbitrary slice and dice on their
IX. Presto is under active development, and significant new results at interactive latency (∼5-30s) to gain deeper insights.
functionality is added frequently. In this paper, we describe It is difficult to satisfy this requirement by pre-aggregating
Presto as of version 0.211, released in September 2018. data, so results must be computed on the fly. Producing results
requires joining multiple large data sets, which include user,
II. U SE C ASES device, test, and event attributes. Query shapes are restricted
At Facebook, we operate numerous Presto clusters (with sizes to a small set since queries are programmatically generated.
up to ∼1000 nodes) and support several different use cases.
In this section we select four diverse use cases with large D. Developer/Advertiser Analytics
deployments and describe their requirements. Several custom reporting tools for external developers and
advertisers are built on Presto. One example deployment of
A. Interactive Analytics
this use case is Facebook Analytics3 , which offers advanced
Facebook operates a massive multi-tenant data warehouse analytics tools to developers that build applications which use
as an internal service, where several business functions and the Facebook platform. These deployments typically expose
organizational units share a smaller set of managed clusters. a web interface that can generate a restricted set of query
Data is stored in a distributed filesystem and metadata is stored shapes. Data volumes are large in aggregate, but queries
in a separate service. These systems have APIs similar to that are highly selective, as users can only access data for their
of HDFS and the Hive metastore service, respectively. We refer own applications or ads. Most query shapes contain joins,
to this as the ‘Facebook data warehouse’, and use a variant of aggregations or window functions. Data ingestion latency is
the Presto ‘Hive’ connector to read from and write to it. in the order of minutes. There are very strict query latency
Facebook engineers and data scientists routinely examine requirements (∼50ms-5s) as the tooling is meant to be inter-
small amounts of data (∼50GB-3TB compressed), test hy- active. Clusters must have 99.999% availability and support
potheses, and build visualizations or dashboards. Users often hundreds of concurrent queries given the volume of users.
rely on query authoring tools, BI tools, or Jupyter notebooks.
Individual clusters are required to support 50-100 concurrent III. A RCHITECTURE OVERVIEW
running queries with diverse query shapes, and return results
A Presto cluster consists of a single coordinator node and
within seconds or minutes. Users are highly sensitive to end-
one or more worker nodes. The coordinator is responsible
to-end wall clock time, and may not have a good intuition
for admitting, parsing, planning and optimizing queries as
of query resource requirements. While performing exploratory
well as query orchestration. Worker nodes are responsible for
analysis, users may not require that the entire result set be
query processing. Figure 1 shows a simplified view of Presto
returned. Queries are often canceled after initial results are
architecture.
returned, or use LIMIT clauses to restrict the amount of result
data the system should produce.

B. Batch ETL Metadata API Data Location API External


Storage
The data warehouse we described above is populated with
System
fresh data at regular intervals using ETL queries. Queries are
scheduled by a workflow management system that determines Queue Planner/Optimizer Scheduler
dependencies between tasks and schedules them accordingly.
Presto supports users migrating from legacy batch processing Coordinator
Query
systems, and ETL queries now make up a large fraction of
the Presto workload at Facebook by CPU. These queries Processor Data Source API
are typically written and optimized by data engineers. They
tend to be much more resource intensive than queries in the Processor Worker
Interactive Analytics use case, and often involve performing
Results Worker Processor Data Source API
CPU-heavy transformations and memory-intensive (multiple
TBs of distributed memory) aggregations or joins with other Worker
large tables. Query latency is somewhat less important than
Fig. 1. Presto Architecture
resource efficiency and overall cluster throughput.
The client sends an HTTP request containing a SQL state-
C. A/B Testing
ment to the coordinator. The coordinator processes the request
A/B testing is used at Facebook to evaluate the impact of
product changes through statistical hypothesis testing. Much of 3 https://analytics.facebook.com
by evaluating queue policies, parsing and analyzing the SQL 2) Parsing: Presto uses an ANTLR-based parser to convert
text, creating and optimizing distributed execution plan. SQL statements into a syntax tree. The analyzer uses this
The coordinator distributes this plan to workers, starts exe- tree to determine types and coercions, resolve functions and
cution of tasks and then begins to enumerate splits, which are scopes, and extracts logical components, such as subqueries,
opaque handles to an addressable chunk of data in an external aggregations, and window functions.
storage system. Splits are assigned to the tasks responsible for 3) Logical Planning: The logical planner uses the syntax
reading this data. tree and analysis information to generate an intermediate
Worker nodes running these tasks process these splits by representation (IR) encoded in the form of a tree of plan nodes.
fetching data from external systems, or process intermediate Each node represents a physical or logical operation, and the
results produced by other workers. Workers use co-operative children of a plan node are its inputs. The planner produces
multi-tasking to process tasks from many queries concurrently. nodes that are purely logical, i.e. they do not contain any
Execution is pipelined as much as possible, and data flows information about how the plan should be executed. Consider
between tasks as it becomes available. For certain query a simple query:
shapes, Presto is capable of returning results before all the SELECT
data is processed. Intermediate data and state is stored in- orders.orderkey, SUM(tax)
memory whenever possible. When shuffling data between FROM orders
nodes, buffering is tuned for minimal latency. LEFT JOIN lineitem
Presto is designed to be extensible; and provides a versa- ON orders.orderkey = lineitem.orderkey
WHERE discount = 0
tile plugin interface. Plugins can provide custom data types, GROUP BY orders.orderkey
functions, access control implementations, event consumers,
queuing policies, and configuration properties. More impor- The logical plan for this query is outlined in Figure 2.
tantly, plugins also provide connectors, which enable Presto to
communicate with external data stores through the Connector
Aggregate [SUM(tax)]
API, which is composed of four parts: the Metadata API, Data
Location API, Data Source API, and Data Sink API. These
APIs are designed to allow performant implementations of LeftJoin [ON orderkey]

connectors within the environment of a physically distributed


execution engine. Developers have contributed over a dozen Filter [discount=0] Scan [orders]
connectors to the main Presto repository, and we are aware of
several proprietary connectors. Scan [lineitem]
IV. S YSTEM D ESIGN
Fig. 2. Logical Plan
In this section we describe some of the key design decisions
and features of the Presto engine. We describe the SQL dialect C. Query Optimization
that Presto supports, then follow the query lifecycle all the way The plan optimizer transforms the logical plan into a more
from client to distributed execution. We also describe some physical structure that represents an efficient execution strategy
of the resource management mechanisms that enable multi- for the query. The process works by evaluating a set of
tenancy in Presto. Finally, we briefly discuss fault tolerance. transformation rules greedily until a fixed point is reached.
A. SQL Dialect Each rule has a pattern that can match a sub-tree of the
Presto closely follows the ANSI SQL specification [2]. While query plan and determines whether the transformation should
the engine does not implement every feature described, im- be applied. The result is a logically equivalent sub-plan that
plemented features conform to the specification as far as replaces the target of the match. Presto contains several rules,
possible. We have made a few carefully chosen extensions to including well-known optimizations such as predicate and
the language to improve usability. For example, it is difficult limit pushdown, column pruning, and decorrelation.
to operate on complex data types, such as maps and arrays, We are in the process of enhancing the optimizer to perform
in ANSI SQL. To simplify operating on these common data a more comprehensive exploration of the search space using
types, Presto syntax supports anonymous functions (lambda a cost-based evaluation of plans based on the techniques
expressions) and built-in higher-order functions (e.g., trans- introduced by the Cascades framework [13]. However, Presto
form, filter, reduce). already supports two cost-based optimizations that take table
and column statistics into account - join strategy selection and
B. Client Interfaces, Parsing, and Planning join re-ordering. We will discuss only a few features of the
1) Client Interfaces: The Presto coordinator primarily ex- optimizer; a detailed treatment is out of the scope of this paper.
poses a RESTful HTTP interface to clients, and ships with 1) Data Layouts: The optimizer can take advantage of
a first-class command line interface. Presto also ships with a the physical layout of the data when it is provided by the
JDBC client, which enables compatibility with a wide variety connector Data Layout API. Connectors report locations and
of BI tools, including Tableau and Microstrategy. other data properties such as partitioning, sorting, grouping,
Stage 0
and indices. Connectors can return multiple layouts for a single
table, each with different properties, and the optimizer can Output
select the most efficient layout for the query [15] [19]. This
collecting-shuffle Stage 1
functionality is used by administrators operating clusters for
the Developer/Advertiser Analytics use case; it enables them to AggregateFinal
optimize new query shapes simply by adding physical layouts.
partitioned-shuffle Stage 2
We will see some of the ways the engine can take advantage
of these properties in the subsequent sections. AggregatePartial
2) Predicate Pushdown: The optimizer can work with con-
Hash
nectors to decide when pushing range and equality predicates
down through the connector improves filtering efficiency. LeftJoin
For example, the Developer/Advertiser Analytics use case
LocalShuffle
leverages a proprietary connector built on top of sharded
partitioned-shuffle partitioned-shuffle
MySQL. The connector divides data into shards that are
Stage 3 Stage 4
stored in individual MySQL instances, and can push range
or point predicates all the way down to individual shards, Hash Hash
ensuring that only matching data is ever read from MySQL. Filter Scan
If multiple layouts are present, the engine selects a layout that
is indexed on the predicate columns. Efficient index based Scan
filtering is very important for the highly selective filters used
Fig. 3. Distributed plan for Figure 2. The connector has not exposed any data
in the Developer/Advertiser Analytics tools. For the Interactive layout properties, and shuffle reduction optimizations have not been applied.
Analytics and Batch ETL use cases, Presto leverages the Four shuffles are required to execute the query.
partition pruning and file-format features (Section V-C) in the
and preferred properties, which are taken into account when
Hive connector to improve performance in a similar fashion.
introducing shuffles. Redundant shuffles are simply elided, but
3) Inter-node Parallelism: Part of the optimization process
in other cases the properties of the shuffle can be changed to
involves identifying parts of the plan that can be executed in
reduce the number of shuffles required. Presto greedily selects
parallel across workers. These parts are known as ‘stages’,
partitioning that will satisfy as many required properties as
and every stage is distributed to one or more tasks, each of
possible to reduce shuffles. This means that the optimizer
which execute the same computation on different sets of input
may choose to partition on fewer columns, which in some
data. The engine inserts buffered in-memory data transfers
cases can result in greater partition skew. As an example,
(shuffles) between stages to enable data exchange. Shuffles add
this optimization applied to the plan in Figure 3 causes it to
latency, use up buffer memory, and have high CPU overhead.
collapse to a single data processing stage.
Therefore, the optimizer must reason carefully about the total
4) Intra-node Parallelism: The optimizer uses a similar
number of shuffles introduced into the plan. Figure 3 shows
mechanism to identify sections within plan stages that can
how a naı̈ve implementation would partition a plan into stages
benefit from being parallelized across threads on a single
and connect them using shuffles.
node. Parallelizing within a node is much more efficient than
Data Layout Properties: The physical data layout can be inter-node parallelism, since there is little latency overhead,
used by the optimizer to minimize the number of shuffles in and state (e.g., hash-tables and dictionaries) can be efficiently
the plan. This is very useful in the A/B Testing use case, shared between threads. Adding intra-node parallelism can
where almost every query requires a large join to produce lead to significant speedups, especially for query shapes where
experiment details or population information. The engine takes concurrency constrains throughput at downstream stages:
advantage of the fact that both tables participating in the join • The Interactive Analytics involves running many short one-
are partitioned on the same column, and uses a co-located join off queries, and users do not typically spend time trying
strategy to eliminate a resource-intensive shuffle. to optimize these. As a result, partition skew is common,
If connectors expose a data layout in which join columns are either due to inherent properties of the data, or as a result
marked as indices, the optimizer is able to determine if using of common query patterns (e.g., grouping by user country
an index nested loop join would be an appropriate strategy. while also filtering to a small set of countries). This typically
This can make it extremely efficient to operate on normalized manifests as a large volume of data being hash-partitioned
data stored in a data warehouse by joining against production on to a small number of nodes.
data stores (key-value or otherwise). This is a commonly used • Batch ETL jobs often transform large data sets with little or
feature in the Interactive Analytics use case. no filtering. In these scenarios, the smaller number of nodes
Node Properties: Like connectors, nodes in the plan tree involved in the higher levels of the tree may be insufficient
can express properties of their outputs (i.e. the partitioning, to quickly process the volume of data generated by the leaf
sorting, bucketing, and grouping characteristics of the data) stage. Task scheduling is discussed in Section IV-D2.
[24]. These nodes have the ability to also express required In both of these scenarios, multiple threads per worker per-
forming the computation can alleviate this concurrency bot- connected components of the directed data flow graph that
tleneck to some degree. The engine can run a single sequence must be started at the same time to avoid deadlocks and
of operators (or pipeline) in multiple threads. Figure 4 shows executes those in topological order. For example, if a hash-join
how the optimizer is able to parallelize one section of a join. is executed in phased mode, the tasks to schedule streaming
Stage 0
of the left side will not be scheduled until the hash table is
built. This greatly improves memory efficiency for the Batch
Task 0 Analytics use case.
Stage 1 When the scheduler determines that a stage should be
scheduled according to the policy, it begins to assign tasks
Task 3..n
Task 0 Task 1 Task 2 for that stage to worker nodes.
2) Task Scheduling: The task scheduler examines the plan
tree and classifies stages into leaf and intermediate stages.
Pipeline 0 Leaf stages read data from connectors; while intermediate
HashAggregate stages only process intermediate results from other stages.
LookupJoin Pipeline 1 Leaf Stages: For leaf stages, the task scheduler takes into
HashBuild HashBuild account the constraints imposed by the network and connectors
when assigning tasks to worker nodes. For example, shared-
LocalShuffle
nothing deployments require that workers be co-located with
storage nodes. The scheduler uses the Connector Data Layout
Pipeline 2 API to decide task placement under these circumstances. The
ScanFilterHash ScanHash ScanHash A/B Testing use case requires predictable high-throughput,
low-latency data reads, which are satisfied by the Raptor
Fig. 4. Materialized and optimized plan corresponding to Figure 3, showing connector. Raptor is a storage engine optimized for Presto
tasks, pipelines, and operators. Pipeline 1 and 2 are parallelized across multiple with a shared-nothing architecture that stores ORC files on
threads to speed up the build side of a hash-join. flash disks and metadata in MySQL.
D. Scheduling Profiling shows that a majority of CPU time across our
production clusters is spent decompressing, decoding, filtering
The coordinator distributes plan stages to workers in the
and applying transformations to data read from connectors.
form of executable tasks, which can be thought of as single
This work is highly parallelizable, and running these stages
processing units. Then, the coordinator links tasks in one stage
on as many nodes as possible usually yields the shortest wall
to tasks in other stages, forming a tree of processors linked to
time. Therefore, if there are no constraints, and the data can be
one another by shuffles. Data streams from stage to stage as
divided up into enough splits, a leaf stage task is scheduled
soon as it is available.
on every worker node in the cluster. For the Facebook data
A task may have multiple pipelines within it. A pipeline
warehouse deployments that run in shared-storage mode (i.e.
consists of a chain of operators, each of which performs a
all data is remote), every node in a cluster is usually involved
single, well-defined computation on the data. For example, a
in processing the leaf stage. This execution strategy can be
task performing a hash-join must contain at least two pipelines;
network intensive.
one to build the hash table (build pipeline), and one to stream
The scheduler can also reason about network topology to
data from the probe side and perform the join (probe pipeline).
optimize reads using a plugin-provided hierarchy. Network-
When the optimizer determines that part of a pipeline would
constrained deployments at Facebook can use this mechanism
benefit from increased local parallelism, it can split up the
to express to the engine a preference for rack-local reads over
pipeline and parallelize that part independently. Figure 4 shows
rack-remote reads.
how the build pipeline has been split up into two pipelines, one
to scan data, and the other to build partitions of the hash table. Intermediate Stages: Tasks for intermediate stages can be
Pipelines are joined together by a local in-memory shuffle. placed on any worker node. However, the engine still needs
To execute a query, the engine makes two sets of scheduling to decide how many tasks should be scheduled for each stage.
decisions. The first determines the order in which stages are This decision is based on the connector configuration, the
scheduled, and the second determines how many tasks should properties of the plan, the required data layout, and other
be scheduled, and which nodes they should be placed on. deployment configuration. In some cases, the engine can
1) Stage Scheduling: Presto supports two scheduling poli- dynamically change the number of tasks during execution.
cies for stages: all-at-once and phased. All-at-once mini- Section IV-E3 describes one such scenario.
mizes wall clock time by scheduling all stages of execution 3) Split Scheduling: When a task in a leaf stage begins
concurrently; data is processed as soon as it is available. execution on a worker node, the node makes itself available
This scheduling strategy benefits latency-sensitive use cases to receive one or more splits (described in Section III). The
such as Interactive Analytics, Developer/Advertiser Analytics, information that a split contains varies by connector. When
and A/B Testing. Phased execution identifies all the strongly reading from a distributed file system, a split might consist of
a file path and offsets to a region of the file. For the Redis 5 shows the structure of a page in memory. The driver
key-value store, a split consists of table information, a key and loop continuously moves pages between operators until the
value format, and a list of hosts to query, among other things. scheduling quanta is complete (discussed in Section IV-F1),
Every task in a leaf stage must be assigned one or more or until operators cannot make progress.
splits to become eligible to run. Tasks in intermediate stages 2) Shuffles: Presto is designed to minimize end-to-end
are always eligible to run, and finish only when they are latency while maximizing resource utilization, and our inter-
aborted or all their upstream tasks are completed. node data flow mechanism reflects this design choice. Presto
Split Assignment: As tasks are set up on worker nodes, the uses in-memory buffered shuffles over HTTP to exchange in-
coordinator starts to assign splits to these tasks. Presto asks termediate results. Data produced by tasks is stored in buffers
connectors to enumerate small batches of splits, and assigns for consumption by other workers. Workers request intermedi-
them to tasks lazily. This is a an important feature of Presto ate results from other workers using HTTP long-polling. The
and provides several benefits: server retains data until the client requests the next segment
using a token sent in the previous response. This makes the
• Decouples query response time from the time it takes
acknowledgement implicit in the transfer protocol. The long-
the connector to enumerate a large number of splits. For
polling mechanism minimizes response time, especially when
example, it can take minutes for the Hive connector to
transferring small amounts of data. This mechanism offers
enumerate partitions and list files in each partition directory.
much lower latency than other systems that persist shuffle data
• Queries that can start producing results without processing to disk [4], [21] and allows Presto to support latency-sensitive
all the data (e.g., simply selecting data with a filter) are use cases such as Developer/Advertiser Analytics.
frequently canceled quickly or complete early when a The engine tunes parallelism to maintain target utilization
LIMIT clause is satisfied. In the Interactive Analytics use rates for output and input buffers. Full output buffers cause
case, it is common for queries to finish before all the splits split execution to stall and use up valuable memory, while un-
have even been enumerated. derutilized input buffers add unnecessary processing overhead.
• Workers maintain a queue of splits they are assigned to The engine continuously monitors the output buffer utiliza-
process. The coordinator simply assigns new splits to tasks tion. When utilization is consistently high, it lowers effective
with the shortest queue. Keeping these queues small allows concurrency by reducing the number of splits eligible to be
the system to adapt to variance in CPU cost of processing run. This has the effect of increasing fairness in sharing of net-
different splits and performance differences among workers. work resources. It is also an important efficiency optimization
• Allows queries to execute without having to hold all when dealing with clients (either end-users or other workers)
their metadata in memory. This is important for the Hive that are unable to consume data at the rate it is being produced.
connector, where queries may access millions of splits and Without this functionality, slow clients running complex multi-
can easily consume all available coordinator memory. stage queries could hold tens of gigabytes worth of buffer
These features are particularly useful for the Interactive An- memory for long periods of time. This scenario is common
alytics and Batch ETL use cases, which run on the Facebook even when a small amount of result data (∼10-50MB) is
Hive-compatible data warehouse. It’s worth noting that lazy being downloaded by a BI or query authoring tool over slow
split enumeration can make it difficult to accurately estimate connections in the Interactive Analytics use case.
and report query progress. On the receiver side, the engine monitors the moving aver-
age of data transferred per request to compute a target HTTP
E. Query Execution request concurrency that keeps the input buffers populated
1) Local Data Flow: Once a split is assigned to a thread, it while not exceeding their capacity. This backpressure causes
is executed by the driver loop. The Presto driver loop is more upstream tasks to slow down as their buffers fill up.
complex than the popular Volcano (pull) model of recursive 3) Writes: ETL jobs generally produce data that must be
iterators [1], but provides important functionality. It is much written to other tables. An important driver of write perfor-
more amenable to cooperative multi-tasking, since operators mance in a remote-storage environment is the concurrency
can be quickly brought to a known state before yielding the with which the write is performed (i.e. the aggregate number
thread instead of blocking indefinitely. In addition, the driver of threads writing data through the Connector Data Sink API).
can maximize work performed in every quanta by moving data Consider the example of a Hive connector configured to
between operators that can make progress without additional use Amazon S3 for storage. Every concurrent write to S3
input (e.g., resuming computation of resource-intensive or creates a new file, and hundreds of writes of a small aggregate
explosive transformations). Every iteration of the loop moves amount of data are likely to create small files. Unless these
data between all pairs of operators that can make progress. small units of data can be later coalesced, they are likely to
The unit of data that the driver loop operates on is called create unacceptably high overheads while reading (many slow
a page, which is a columnar encoding of a sequence of metadata operations, and latency-bound read performance).
rows. The Connector Data Source API returns pages when However, using too little concurrency can decrease aggre-
it is passed a split, and operators typically consume input gate write throughput to unacceptable levels. Presto takes
pages, perform computation, and produce output pages. Figure an adaptive approach again, dynamically increasing writer
concurrency by adding tasks on more worker nodes when the handle the diversity of query shapes in the Interactive Ana-
engine determines that the stage producing data for the write lytics and Batch ETL use cases, where Presto gives higher
exceeds a buffer utilization threshold (and a configurable per- priority to queries with lowest resource consumption. This
writer data written threshold). This is an important efficiency choice reflects the understanding that users expect inexpensive
optimization for the write-heavy Batch ETL use case. queries to complete quickly, and are less concerned about
the turnaround time of larger, computationally-expensive jobs.
F. Resource Management Running more queries concurrently, even at the expense of
more context-switching, results in lower aggregate queue time,
One of the key features that makes Presto a good fit for multi- since shorter queries exit the system quickly.
tenant deployments is that it contains a fully-integrated fine- 2) Memory Management: Memory poses one of the main
grained resource management system. A single cluster can resource management challenges in a multi-tenant system like
execute hundreds of queries concurrently, and maximize the Presto. In this section we describe the mechanism by which
use of CPU, IO, and memory resources. the engine controls memory allocations across the cluster.
1) CPU Scheduling: Presto primarily optimizes for overall
cluster throughput, i.e. aggregate CPU utilized for processing Memory Pools: All non-trivial memory allocations in Presto
data. The local (node-level) scheduler additionally optimizes must be classified as user or system memory, and reserve
for low turnaround time for computationally inexpensive memory in the corresponding memory pool. User memory
queries, and the fair sharing of CPU resources amongst queries is memory usage that is possible for users to reason about
with similar CPU requirements. A task’s resource usage is given only basic knowledge of the system or input data (e.g.,
the aggregate thread CPU time given to each of its splits. To the memory usage of an aggregation is proportional to its
minimize coordination overhead, Presto tracks CPU resource cardinality). On the other hand, system memory is memory
usage at the task level and makes scheduling decisions locally. usage that is largely a byproduct of implementation decisions
Presto schedules many concurrent tasks on every worker (e.g., shuffle buffers) and may be uncorrelated with query
node to achieve multi-tenancy and uses a cooperative multi- shape and input data volume.
tasking model. Any given split is only allowed to run on a The engine imposes separate restrictions on user and total
thread for a maximum quanta of one second, after which (user + system) memory; queries that exceed a global limit
it must relinquish the thread and return to the queue. When (aggregated across workers) or per-node limit are killed. When
output buffers are full (downstream stages cannot consume a node runs out of memory, query memory reservations are
data fast enough), input buffers are empty (upstream stages blocked by halting processing for tasks. The total memory
cannot produce data fast enough), or the system is out of limit is usually set to be much higher than the user limit, and
memory, the local scheduler simply switches to processing only a few queries exceed the total limit in production.
another task even before the quanta is complete. This frees up The per-node and global user memory limits on queries are
threads for runnable splits, helps Presto maximize CPU usage, usually distinct; this enables a maximum level of permissible
and is highly adaptive to different query shapes. All of our use usage skew. Consider a 500 node cluster with 100GB of query
cases benefit from this granular resource efficiency. memory available per node and a requirement that individual
When a split relinquishes a thread, the engine needs to queries can use up to 5TB globally. In this case, 10 queries
decide which task (associated with one or more splits) to run can concurrently allocate up to that amount of total memory.
next. Rather than predict the resources required to complete However, if we want to allow for a 2:1 skew (i.e. one partition
a new query ahead of time, Presto simply uses a task’s of the query consumes 2x the median memory), the per-node
aggregate CPU time to classify it into the five levels of a query memory limit would have to be set to 20GB. This means
multi-level feedback queue [8]. As tasks accumulate more that only 5 queries are guaranteed to be able to run without
CPU time, they move to higher levels. Each level is assigned exhausting the available node memory.
a configurable fraction of the available CPU time. In practice, It is important that we be able to run more than 5 queries
it is challenging to accomplish fair cooperative multi-tasking concurrently on a 500-node Interactive Analytics or Batch ETL
with arbitrary workloads. The I/O and CPU characteristics cluster. Given that queries in these clusters vary wildly in their
for splits vary wildly (sometimes even within the same task), memory characteristics (skew, allocation rate, and allocation
and complex functions (e.g., regular expressions) can consume temporal locality), it is unlikely that all five queries allocate
excessive amounts of thread time relative to other splits. Some up to their limit on the same worker node at any given point in
connectors do not provide asynchronous APIs, and worker time. Therefore, it is generally safe to overcommit the memory
threads can be held for several minutes. of the cluster as long as mechanisms exist to keep the cluster
healthy when nodes are low on memory. There are two such
The scheduler must be adaptive when dealing with these
mechanisms in Presto – spilling, and reserved pools.
constraints. The system provides a low-cost yield signal, so
that long running computations can be stopped within an oper- Spilling: When a node runs out of memory, the engine invokes
ator. If an operator exceeds the quanta, the scheduler ‘charges’ the memory revocation procedure on eligible tasks in ascend-
actual thread time to the task, and temporarily reduces future ing order of their execution time, and stops when enough
execution occurrences. This adaptive behavior allows us to memory is available to satisfy the last request. Revocation is
processed by spilling state to disk. Presto supports spilling for However, we are actively working on improved fault tol-
hash joins and aggregations. However, we do not configure erance for long running queries. We are evaluating adding
any of the Facebook deployments to spill. Cluster sizes are optional checkpointing and limiting restarts to sub-trees of a
typically large enough to support several TBs of distributed plan that cannot be run in a pipelined fashion.
memory, users appreciate the predictable latency of fully in-
memory execution, and local disks would increase hardware V. Q UERY P ROCESSING O PTIMIZATIONS
costs (especially in Facebook’s shared-storage deployments). In this section, we describe a few important query processing
Reserved Pool: If a node runs out of memory and the cluster optimizations that benefit most use cases.
is not configured to spill, or there is no revocable memory A. Working with the JVM
remaining, the reserved memory mechanism is used to unblock
the cluster. The query memory pool on every node is further Presto is implemented in Java and runs on the Hotspot
sub divided into two pools: general and reserved. When the Java Virtual Machine (JVM). Extracting the best possible
general pool is exhausted on a worker node, the query using performance out of the implementation requires playing to
the most memory on that worker gets ‘promoted’ to the the strengths and limitations of the underlying platform.
reserved pool on all worker nodes. In this state, the memory Performance-sensitive code such as data compression or
allocated to that query is counted towards the reserved pool checksum algorithms can benefit from specific optimizations
rather than the general pool. To prevent deadlock (where or CPU instructions. While there is no application-level mech-
different workers stall different queries) only a single query anism to control how the JVM Just-In-Time (JIT) compiler
can enter the reserved pool across the entire cluster. If the generates machine code, it is possible to structure the code
general pool on a node is exhausted while the reserved so that it can take advantage of optimizations provided by
pool is occupied, all memory requests from other tasks on the JIT compiler, such as method inlining, loop unrolling, and
that node are stalled. The query runs in the reserved pool intrinsics. We are exploring the use of Graal [22] in scenarios
until it completes, at which point the cluster unblocks all where the JVM is unable to generate optimal machine code,
outstanding requests for memory. This is somewhat wasteful, such as 128-bit math operations.
as the reserved pool on every node must be sized to fit queries The choice of garbage collection (GC) algorithm can have
running up against the local memory limits. Clusters can be dramatic effects on application performance and can even
configured to instead kill the query that unblocks most nodes. influence application implementation choices. Presto uses the
G1 collector, which deals poorly with objects larger than a
G. Fault Tolerance certain size. To limit the number of these objects, Presto avoids
Presto is able to recover from many transient errors using allocating objects or buffers bigger than the ‘humongous’
low-level retries. However, as of late 2018, Presto does not threshold and uses segmented arrays if necessary. Large and
have any meaningful built-in fault tolerance for coordinator highly linked object graphs can also be problematic due to
or worker node crash failures. Coordinator failures cause the maintenance of remembered set structures in G1 [10]. Data
cluster to become unavailable, and a worker node crash failure structures in the critical path of query execution are imple-
causes all queries running on that node to fail. Presto relies mented over flat memory arrays to reduce reference and object
on clients to automatically retry failed queries. counts and make the job of the GC easier. For example, the
In production at Facebook, we use external orchestration HISTOGRAM aggregation stores the bucket keys and counts
mechanisms to run clusters in different availability modes for all groups in a set of flat arrays and hash tables instead of
depending on the use case. The Interactive Analytics and Batch maintaining independent objects for each histogram.
ETL use cases run standby coordinators, while A/B Testing
and Developer/Advertiser Analytics run multiple active clus- B. Code Generation
ters. External monitoring systems identify nodes that cause an One of the main performance features of the engine is code
unusual number of failures and remove them from clusters, generation, which targets JVM bytecode. This takes two forms:
and nodes that are remediated automatically re-join the cluster. 1) Expression Evaluation: The performance of a query en-
Each of these mechanisms reduce the duration of unavailability gine is determined in part by the speed at which it can evaluate
to varying degrees, but cannot hide failures entirely. complex expressions. Presto contains an expression interpreter
Standard checkpointing or partial-recovery techniques are that can evaluate arbitrarily complex expressions that we use
computationally expensive, and difficult to implement in a sys- for tests, but is much too slow for production use evaluating
tem designed to stream results back to clients as soon as they billions of rows. To speed this up, Presto generates bytecode
are available. Replication-based fault tolerance mechanisms to natively deal with constants, function calls, references to
[6] also consume significant resources. Given the cost, the variables, and lazy or short-circuiting operations.
expected value of such techniques is unclear, especially when 2) Targeting JIT Optimizer Heuristics: Presto generates
taking into account the node mean-time-to-failure, cluster sizes bytecode for several key operators and operator combinations.
of ∼1000 nodes and telemetry data showing that most queries The generator takes advantage of the engine’s superior knowl-
complete within a few hours, including Batch ETL. Other edge of the semantics of the computation to produce bytecode
researchers have come to similar conclusions [17]. that is more amenable to JIT optimization than that of a
generic processing loop. There are three main behaviors that Page 0
the generator targets: partkey returnflag shipinstruct

52470 Indices
• Since the engine switches between different splits from 50600
1
0 Dictionary
distinct task pipelines every quanta (Section IV-F1), the JIT 18866 "F" x 6 1
72387 0: "IN PERSON"
2 1: "COD"
would fail to optimize a common loop based implementa- 7429 0 2: "RETURN"
44077 2 3: "NONE"
tion since the collected profiling information for the tight
processing loop would be polluted by other tasks or queries. LongBlock RLEBlock DictionaryBlock
• Even within the processing loop for a single task pipeline, Page 1
the engine is aware of the types involved in each com- partkey returnflag shipinstruct
putation and can generate unrolled loops over columns. 164648
Indices
2
Eliminating target type variance in the loop body causes 35173 2
139350 "O" x 5 3 Dictionary
the profiler to conclude that call sites are monomorphic, 40227 0
87261 1
allowing it to inline virtual methods.
• As the bytecode generated for every task is compiled into LongBlock RLEBlock DictionaryBlock
a separate Java class, each can be profiled independently
Fig. 5. Different block types within a page
by the JIT optimizer. In effect, the JIT optimizer further
adapts a custom program generated for the query to the data D. Lazy Data Loading
actually processed. This profiling happens independently at Presto supports lazy materialization of data. This functionality
each task, which improves performance in environments can leverage the columnar, compressed nature of file formats
where each task processes a different partition of the data. such as ORC, Parquet, and RCFile. Connectors can generate
Furthermore, the performance profile can change over the lazy blocks, which read, decompress, and decode data only
lifetime of the task as the data changes (e.g., time-series when cells are actually accessed. Given that a large fraction
data or logs), causing the generated code to be updated. of CPU time is spent decompressing and decoding and that it
Generated bytecode also benefits from the second order ef- is common for filters to be highly selective, this optimization
fects of inlining. The JVM is able to broaden the scope of is highly effective when columns are infrequently accessed.
optimizations, auto-vectorize larger parts of the computation, Tests on a sample of production workload from the Batch
and can take advantage of frequency-based basic block layout ETL use case show that lazy loading reduces data fetched by
to minimize branches. CPU branch prediction also becomes far 78%, cells loaded by 22% and total CPU time by 14%.
more effective [7]. Bytecode generation improves the engine’s
ability to store intermediate results in registers or caches rather E. Operating on Compressed Data
than in memory [16]. Presto operates on compressed data (i.e. dictionary and run-
length-encoded blocks) sourced from the connector wherever
possible. Figure 5 shows how these blocks are structured
C. File Format Features
within a page. When a page processor evaluating a transfor-
Scan operators invoke the Connector API with leaf split mation or filter encounters a dictionary block, it processes all
information and receive columnar data in the form of Pages. of the values in the dictionary (or the single value in a run-
A page consists of a list of Blocks, each of which is a column length-encoded block). This allows the engine to process the
with a flat in-memory representation. Using flat memory data entire dictionary in a fast unconditional loop. In some cases,
structures is important for performance, especially for complex there are more values present in the dictionary than rows in
types. Pointer chasing, unboxing, and virtual method calls add the block. In this scenario the page processor speculates that
significant overhead to tight loops. the un-referenced values will be used in subsequent blocks.
Connectors such Hive and Raptor take advantage of specific The page processor keeps track of the number of real rows
file format features where possible [20]. Presto ships with produced and the size of the dictionary, which helps measure
custom readers for file formats that can efficiently skip data the effectiveness of processing the dictionary as compared to
sections by using statistics in file headers/footers (e.g., min- processing all the indices. If the number of rows is larger than
max range headers and Bloom filters). The readers can convert the size of the dictionary it is likely more efficient to process
certain forms of compressed data directly into blocks, which the dictionary instead. When the page processor encounters a
can be efficiently operated upon by the engine (Section V-E). new dictionary in the sequence of blocks, it uses this heuristic
Figure 5 shows the layout of a page with compressed encod- to determine whether to continue speculating.
ing schemes for each column. Dictionary-encoded blocks are Presto also leverages dictionary block structure when build-
very effective at compressing low-cardinality sections of data ing hash tables (e.g., joins or aggregations). As the indices
and run-length encoded (RLE) blocks compress repeated data. are processed, the operator records hash table locations for
Several pages may share a dictionary, which greatly improves every dictionary entry in an array. If the entry is repeated
memory efficiency. A column in an ORC file can use a single for a subsequent index, it simply re-uses the location rather
dictionary for an entire ‘stripe’ (up to millions of rows). than re-computing it. When successive blocks share the same
Worker Avg. CPU
50 25

# Queries
40 20
30 15
20 10
10 5
0 0
dictionary, the page processor retains the array to further to perform exploratory analysis over the Hadoop warehouse

20

40

60

80

100

120

140

160

180

200

220
reduce the necessary computation. and then load aggregate Time results or frequently-accessed
(min after period start) data
Presto also produces intermediate compressed results during into Raptor for faster analysis and low-latency dashboards.
execution. The join processor, for example, produces dictio- 700

nary or run-length-encoded blocks when it is more efficient to 600

Execution Time (seconds)


do so. For a hash join, when the probe side of the join looks 500

up keys in the hash table, it records value indices into an 400

array rather than copying the actual data. The operator simply 300

produces a dictionary block where the index list is that array, 200

and the dictionary is a reference to the block in the hash table. 100

0
VI. P ERFORMANCE q09 q18 q20 q26 q28 q35 q37 q44 q50 q54 q60 q64 q69 q71 q73 q76 q78 q80 q82

Hive/HDFS (no stats) Hive/HDFS (table/column stats) Raptor


In this section, we present performance results that demon-
Fig. 6. Query runtimes for a subset of TPC-DS
strate the impact of some of the main design decisions B. 40Flexibility
described in this paper. 35
Presto’s flexibility is in large part due to its low-latency data

Total Execution Time (min)


30
A. Adaptivity shuffle mechanism in conjunction with a Connector API that
25
Within Facebook, we run several different connectors in pro- supports performant processing of large volumes of data. Fig-
20
duction to allow users to process data stored in various internal ure 7 shows a distribution of query runtimes from production
15

systems. Table 1 outlines the connectors and deployments that deployments


10
of the selected use cases. We include only queries
are used to support the use cases outlined in Section II. that5 are successful and actually read data from storage. The
To demonstrate how Presto adapts to connector character- results
0 demonstrate that Presto can be configured to effectively
1 3 5 7 9 11
istics, we compare runtimes for queries from the TPC-DS serve web use cases withConcurrent
strict Queries
latency requirements (20-
benchmark at scale factor 30TB. Presto is capable of running 100ms) as well as programmatically scheduled ETL jobs that
all TPC-DS queries, but for this experiment we select a low- run for several hours.
10/10/2018 raghavsethi WIP > Main | unidash

100
memory subset that does not require spilling.
We use Presto version 0.211 with internal variants of 75
the Hive/HDFS and Raptor connectors. Raptor is a shared-
CDF (%)

nothing storage engine designed for Presto. It uses MySQL 50

for metadata and stores data on local flash disks in ORC


25
format. Raptor supports complex data organization (sorting,
bucketing, and temporal columns), but for this experiment 0
our data is randomly partitioned. The Hive connector uses an
s

ec

ec

in

in

in

hr
1h

5h
m

6m

se

1m

4m

19
1s

4s
16

64

16

17
25

internal service similar to the Hive Metastore and accesses files


Query Execution Time (seconds) (log scale)
encoded in an ORC-like format on a remote distributed filesys-
Dev/Advertiser Analytics A/B Testing
tem that is functionally similar to HDFS (i.e., a shared-storage Interactive Analytics Batch ETL
architecture). Performance characteristics of these connector Fig. 7. Query runtime distribution for selected use cases
variants are similar to deployments on public cloud providers. C. Resource Management
Every query is run with three settings on a 100-node test Presto’s integrated fine-grained resource management system
cluster: (1) Data stored in Raptor with table shards randomly allows it to quickly move CPU and memory resources between
distributed between nodes. (2) Data stored in Hive/HDFS queries to maximize resource efficiency in multi-tenant clus-
with no statistics. (3) Data stored in Hive/HDFS along with ters. Figure 8 shows a four hour trace of CPU and concurrency
table and column statistics. Presto’s optimizer can make cost- metrics from one of our Interactive Analytics clusters. Even
based decisions about join order and join strategy when these as demand drops from a peak of 44 queries to a low of
statistics are available. Every node is configured with a 28- 8 queries, Presto continues to utilize an average of ∼90%
core IntelTM XeonTM E5-2680 v4 CPU running at 2.40GHz, https://our.internmc.facebook.com/intern/unidash/dashboard/raghavsethi_wip/main/
CPU across worker nodes. It is also worth noting that the
1.6TB of flash storage and 256GB of DDR4 RAM. scheduler prioritizes new and inexpensive workloads as they
Figure 6 shows that Presto query runtime is greatly impacted arrive to maintain responsiveness (Section IV-F1). It does
by the characteristics of connectors. With no change to the this by allocating large fractions of cluster-wide CPU to new
query or cluster configuration, Presto is able to adapt to the queries within milliseconds of them being admitted.
connector by taking advantage of its characteristics, including
throughput, latency, and the availability of statistics. It also VII. E NGINEERING L ESSONS
demonstrates how a single Presto cluster can serve both as Presto has been developed and operated as a service by a small
a traditional enterprise data warehouse (that data must be team at Facebook since 2013. We observed that some engi-
ingested into) and also a query engine over a Hadoop data neering philosophies had an outsize impact on Presto’s design
warehouse. Data engineers at Facebook frequently use Presto through feedback loops in a rapidly evolving environment:
Use Case Query Duration Workload Shape Cluster Size Concurrency Connector
Joins, aggregations and window
Developer/Advertiser Analytics 50 ms - 5 sec 10s of nodes 100s of queries Sharded MySQL
functions
Transform, filter and join billions
A/B Testing 1 sec - 25 sec 100s of nodes 10s of queries Raptor
of rows
Exploratory analysis on up to
Interactive Analytics 10 sec - 30 min 100s of nodes 50-100 queries Hive/HDFS
∼3TB of data
Transform, filter, and join or
Batch ETL 20 min - 5 hr Upto 1000 nodes 10s of queries Hive/HDFS
aggregate 1-100+TB of input data
TABLE I
P RESTO DEPLOYMENTS TO SUPPORT SELECTED USE CASES

100
CPU Utilization Concurrency
50
or conflicting properties. This model poses its own set of
90 45 challenges. However, with a large number of clusters and
Worker Avg. CPU Utilization (%)

80 40 configuration sets, it is more efficient to shift complexity from

# Queries Running
70 35
operational investigations to the deployment process/tooling.
60 30
50 25 Vertical integration: Like other engineering teams, we design
40 20 custom libraries for components where performance and effi-
30 15
ciency are important. For example, custom file-format readers
20 10
10 5
allow us to use Presto-native data structures end-to-end and
0 0 avoid conversion overhead. However, we observed that the
0

20

40

60

80

100

120

140

160

180

200

220
ability to easily debug and control library behaviors is equally
Time (min after period start)
important when operating a highly multi-threaded system that
Fig. 8. Cluster avg. CPU utilization and concurrency over a 4-hour period
700
performs arbitrary computation in a long-lived process.
Adaptiveness
600 over configurability: As a complex multi- Consider an example of a recent production issue. Presto
Execution Time (seconds)

tenant query engine that executes arbitrary user defined com- uses the Java built-in gzip library. While debugging a se-
500
putation, Presto must be adaptive not only to different query quence of process crashes, we found that interactions between
400
characteristics, but also combinations of characteristics. For glibc and the gzip library (which invokes native code)
300
example, until Presto had end-to-end adaptive backpressure caused memory fragmentation. For specific workload combi-
200
(Section IV-E2), large amounts of memory and CPU was nations, this caused large native memory leaks. To address
100
utilized by a small number of jobs with slow clients, which this, we changed the way we use the library to influence the
0
adversely
q09 q18affected
q20 q26 q28 latency-sensitive
q35 q37 q44 q50 q54 q60jobs q64 q69that were
q71 q73 running
q76 q78 q80 q82 right cache flushing behavior, but in other cases we have gone
concurrently.Hive/HDFS
Without adaptiveness,
(no stats) it would
Hive/HDFS (table/column stats) be necessary to
Raptor as far as writing our own libraries for compression formats.
narrowly partition workloads and tune configuration for each Custom libraries can also improve developer efficiency –
40
workload independently. That approach would not scale to the reducing the surface area for bugs by only implementing
35 necessary features, unifying configuration management, and
Total Execution Time (min)

wide variety of query shapes that we see in production.


30
supporting detailed instrumentation to match our use case.
Effortless
25 instrumentation: Presto exposes fine-grained per-
formance
20 statistics at the query and node level. We maintain VIII. R ELATED W ORK
our15 own libraries for efficient statistics collection which use Systems that run SQL against large data sets have become
flat-memory
10 for approximate data structures. It is important popular over the past decade. Each of these systems present a
to encourage
5 observable system design and allow engineers to unique set of tradeoffs. A comprehensive examination of the
0
instrument1 and understand
3 the
5 performance
7 of 9their code.
11 Our space is outside the scope of this paper. Instead, we focus on
libraries make adding statistics as easy
Concurrent Queriesas annotating a method.
some of the more notable work in the area.
As a consequence, the median Presto worker node exports Apache Hive [21] was originally developed at Facebook
∼10,000 real-time performance counters, and we collect and to provide a SQL-like interface over data stored in HDFS,
store operator level statistics (and merge up to task and stage and executes queries by compiling them into MapReduce [9]
level) for every query. Our investment in telemetry tooling or Tez [18] jobs. Spark SQL [4] is a more modern system
allows us to be data-driven when optimizing the system. built on the popular Spark engine [23], which addresses many
Static configuration: Operational issues in a complex system of the limitations of MapReduce. Spark SQL can run large
like Presto are difficult to root cause and mitigate quickly. queries over multiple distributed data stores, and can operate
Configuration properties can affect system performance in on intermediate results in memory. However, these systems do
ways that are hard to reason about, and we prioritize being not support end-to-end pipelining, and usually persist data to a
able to understand the state of the cluster over the ability to filesystem during inter-stage shuffles. Although this improves
change configuration quickly. Unlike several other systems at fault tolerance, the additional latency causes such systems to
Facebook, Presto uses static rather than dynamic configura- be a poor fit for interactive or low-latency use cases.
tion wherever possible. We developed our own configuration Products like Vertica [15], Teradata, Redshift, and Oracle
library, which is designed to fail ‘loudly’ by crashing at startup Exadata can read external data to varying degrees. However,
if there are any warnings; this includes unused, duplicated, they are built around an internal data store and achieve
peak performance when operating on data loaded into the We are very grateful for the contributions and continued
system. Some systems take the hybrid approach of integrating support of Piotr Findeisen, Grzegorz Kokosiński, Łukasz Osip-
RDBMS-style and MapReduce execution, such as Microsoft iuk, Karol Sobczak, Piotr Nowojski, and the rest of the Presto
SQL Server Polybase [11] (for unstructured data) and Hadapt open-source community.
[5] (for performance). Apache Impala [14] can provide in-
R EFERENCES
teractive latency, but operates within the Hadoop ecosystem.
In contrast, Presto is data source agnostic. Administrators [1] Volcano - An Extensible and Parallel Query Evaluation System. IEEE
Transactions on Knowledge and Data Engineering, 6(1):120–135, 1994.
can deploy Presto with a vertically-integrated data store like [2] SQL – Part 1: Framework (SQL/Framework). ISO/IEC 9075-1:2016,
Raptor, but can also configure Presto to query data from International Organization for Standardization, 2016.
a variety of systems (including relational/NoSQL databases, [3] D. Abadi, S. Madden, and M. Ferreira. Integrating Compression and
Execution in Column-Oriented Database Systems. In SIGMOD, 2006.
proprietary internal services and stream processing systems) [4] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley,
with low overhead, even within a single Presto cluster. X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark
Presto builds on a rich history of innovative techniques SQL: Relational Data Processing in Spark. In SIGMOD, 2015.
[5] K. Bajda-Pawlikowski, D. J. Abadi, A. Silberschatz, and E. Paulson.
developed by the systems and database community. It uses Efficient Processing of Data Warehousing Queries in a Split Execution
techniques similar to those described by Neumann [16] and Environment. In SIGMOD, 2011.
Diaconu et al. [12] on compiling query plans to significantly [6] M. Balazinska, H. Balakrishnan, S. Madden, and M. Stonebraker. Fault-
tolerance in the borealis distributed stream processing system. In
speed up query processing. It operates on compressed data SIGMOD, 2005.
where possible, using techniques from Abadi et al. [3], and [7] B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, P. Aragonda, V. Lychagina,
generates compressed intermediate results. It can select the Y. Kwon, and M. Wong. Tenzing A SQL Implementation On The
MapReduce Framework. In PVLDB, volume 4, pages 1318–1327, 2011.
most optimal layout from multiple projections a là Vertica and [8] F. J. Corbató, M. Merwin-Daggett, and R. C. Daley. An Experimental
C-Store [19] and uses strategies similar to Zhou et al. [24] to Time-Sharing System. In Proceedings of the Spring Joint Computer
minimize shuffles by reasoning about plan properties. Conference, 1962.
[9] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on
Large Clusters. In OSDI, 2004.
IX. C ONCLUSION [10] D. Detlefs, C. Flood, S. Heller, and T. Printezis. Garbage-first Garbage
Collection. In ISMM, 2004.
In this paper, we presented Presto, an open-source MPP SQL [11] D. J. DeWitt, A. Halverson, R. V. Nehme, S. Shankar, J. Aguilar-Saborit,
query engine developed at Facebook to quickly process large A. Avanes, M. Flasza, and J. Gramling. Split Query Processing in
Polybase. In SIGMOD, 2013.
data sets. Presto is designed to be flexible; it can be configured [12] C. Diaconu, C. Freedman, E. Ismert, P.-A. Larson, P. Mittal, R. Stoneci-
for high-performance SQL processing in a variety of use pher, N. Verma, and M. Zwilling. Hekaton: SQL Server’s Memory-
cases. Its rich plugin interface and Connector API make it Optimized OLTP Engine. In SIGMOD, 2013.
[13] G. Graefe. The Cascades Framework for Query Optimization. IEEE
extensible, allowing it to integrate with various data sources Data Engineering Bulletin, 18(3):19–29, 1995.
and be effective in many environments. The engine is also [14] M. Kornacker, A. Behm, V. Bittorf, T. Bobrovytsky, C. Ching, A. Choi,
designed to be adaptive; it can take advantage of connector J. Erickson, M. Grund, D. Hecht, M. Jacobs, I. Joshi, L. Kuff, D. Kumar,
A. Leblang, N. Li, I. Pandis, H. Robinson, D. Rorke, S. Rus, J. Russell,
features to speed up execution, and can automatically tune read D. Tsirogiannis, S. Wanderman-Milne, and M. Yoder. Impala: A
and write parallelism, network I/O, operator heuristics, and Modern, Open-Source SQL Engine for Hadoop. In CIDR, 2015.
scheduling to the characteristics of the queries running in the [15] A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandiver, L. Doshi,
and C. Bear. The Vertica Analytic Database: C-Store 7 Years Later.
system. Presto’s architecture enables it to service workloads PVLDB, 5(12):1790–1801, 2012.
that require very low latency and also process expensive, long- [16] T. Neumann. Efficiently Compiling Efficient Query Plans for Modern
running queries efficiently. Hardware. PVLDB, 4(9):539–550, 2011.
[17] A. Rasmussen, M. Conley, G. Porter, R. Kapoor, A. Vahdat, et al.
Presto allows organizations like Facebook to deploy a single Themis: An I/O-Efficient MapReduce. In SoCC, 2012.
SQL system to deal with multiple common analytic use cases [18] B. Saha, H. Shah, S. Seth, G. Vijayaraghavan, A. Murthy, and C. Curino.
and easily query multiple storage systems while also scaling Apache Tez: A Unifying Framework for Modeling and Building Data
Processing Applications. In SIGMOD, 2015.
up to ∼1000 nodes. Its architecture and design have found a [19] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack,
niche within the crowded SQL-on-Big-Data space. Adoption M. Ferreira, E. Lau, A. Lin, S. Madden, E. O’Neil, et al. C-Store:
at Facebook and in the industry is growing quickly, and our A Column-oriented DBMS. In VLDB, 2005.
[20] D. Sundstrom. Even faster: Data at the speed of Presto
open-source community continues to remain engaged. ORC, 2015. https://code.facebook.com/posts/370832626374903/
even-faster-data-at-the-speed-of-presto-orc/.
ACKNOWLEDGEMENT [21] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang,
S. Anthony, H. Liu, and R. Murthy. Hive A Petabyte Scale Data
We would like to thank Vaughn Washington, Jay Tang, Ravi Warehouse Using Hadoop. In ICDE, 2010.
[22] T. Würthinger, C. Wimmer, A. Wöß, L. Stadler, G. Duboscq, C. Humer,
Murthy, Ahmed El Zein, Greg Leclercq, Varun Gajjala, Ying G. Richards, D. Simon, and M. Wolczko. One VM to Rule Them All.
Su, Andrii Rosa, Rebecca Schlussel, German Gil, Jiexi Lin, In ACM Onward! 2013, pages 187–204. ACM, 2013.
Masha Basmanova, Rongrong Zhong, Shixuan Fan, Elon [23] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J.
Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A
Azoulay, Timothy Meehan, and many others at Facebook for Fault-Tolerant Abstraction for In-Memory Cluster Computing. In NSDI,
their contributions to this paper and to Presto. We’re thankful 2012.
to David DeWitt, Nathan Bronson, Mayank Pundir, and Pedro [24] J. Zhou, P. A. Larson, and R. Chaiken. Incorporating partitioning and
parallel plans into the scope optimizer. In ICDE, 2010.
Pedreira for their feedback on drafts of this paper.

You might also like