Communications of ACM July 2016
Communications of ACM July 2016
Communications of ACM July 2016
ACM
OF THE
CACM.ACM.ORG
RESEARCH
HIGHLIGHTS
FORMULA-BASED DEBUGGING
GOOGLES MESA DATA WAREHOUSING SYSTEM
CONTRIBUTED
ARTICLES
JULY 1
Association for
Computing Machinery
this.splash 2016
Sun 30 October Fri 4 November 2016
Amsterdam
Onward!
SPLASH-I
World class speakers on current topics in software, systems, and languages research
SPLASH-E
DLS
GPCE
SLE
Biermann
Mvenpick
Publications: Alex Potanin Amsterdam
@splashcon
2016.splashcon.org
bit.ly/splashcon16
IsIsInternet
softwaresosodifferent
different
from
ordinary
software?
This practically
book practically
this questio
Internet software
from
ordinary
software?
This book
answersanswers
this question
through
the presentation
presentationofof
a software
design
method
theChart
State XML
ChartW3C
XML
W3C standard
through the
a software
design
method
basedbased
on theon
State
standard
along
Java.Web
Webenterprise,
enterprise,
Internet-of-Things,
and Android
applications,
in particular,
are
along with
with Java.
Internet-of-Things,
and Android
applications,
in particular,
are
seamlessly
specifiedand
andimplemented
implemented
from
executable
models.
seamlessly specified
from
executable
models.
Internet software
thethe
idea
of event-driven
or reactive
programming,
as pointed
out in out in
Internet
softwareputs
putsforward
forward
idea
of event-driven
or reactive
programming,
as pointed
Bonr et
et al.
It tells
us that
reactiveness
is a must.
However,
beyondbeyond
concepts,concepts,
Bonr
al.ssReactive
ReactiveManifesto.
Manifesto.
It tells
us that
reactiveness
is a must.
However,
software engineers
means
withwith
which
to puttoreactive
programming
into practice.
software
engineersrequire
requireeffective
effective
means
which
put reactive
programming
into practice.
Reactive
Internet
Programming
outlines
and
explains
such
means.
Reactive Internet Programming outlines and explains such means.
The lack of professional examples in the literature that illustrate how reactive software should
The
lack of professional examples in the literature that illustrate how reactive software should
be shaped can be quite frustrating. Therefore, this book helps to fill in that gap by providing inbe
shaped
can be quite
frustrating.
Therefore,
this bookdetails
helps and
to fill
in that gap
by providing indepth
professional
case studies
that contain
comprehensive
meaningful
alternatives.
depth
professional
casestudies
studiescan
that
comprehensive
details and meaningful alternatives.
Furthermore,
these case
be contain
downloaded
for further investigation.
Furthermore, these case studies can be downloaded for further investigation.
Internet software requires higher adaptation, at run time in particular. After reading Reactive Internet
Programming,
you requires
will be ready
to enter
the forthcoming
Internet
era.
Internet
software
higher
adaptation,
at run time
in particular.
After reading Reactive Interne
News
Viewpoints
Editors Letter
22 Legally Speaking
Cerfs Up
25 Historical Reflections
8
15
12 Graph Matching in
Last Byte
128 Upstart Puzzles
Chair Games
By Dennis Shasha
17 Booming Enrollments
Inverse Privacy
Seeking a market-based solution
to the problem of a persons
unjustified inaccessibility to
their private information.
By Yuri Gurevich, Efim Hudis,
and Jeannette M. Wing
| J U LY 201 6 | VO L . 5 9 | NO. 7
07/2016
VOL. 59 NO. 07
Practice
Contributed Articles
Review Articles
96 The Rise of Social Bots
44
44 Should You Upload or
88
68 Formula-Based Software Debugging
Research Highlights
106 Technical Perspective
Combining Logic
and Probability
By Henry Kautz and Parag Singla
107 Probabilistic Theorem Proving
Communications of the ACM is the leading monthly print and online magazine for the computing and information technology fields.
Communications is recognized as the most trusted and knowledgeable source of industry information for todays computing professional.
Communications brings its readership in-depth coverage of emerging areas of computer science, new trends in information technology,
and practical applications. Industry leaders use Communications as a platform to present and debate various technology implications,
public policies, engineering challenges, and market trends. The prestige and unmatched reputation that Communications of the ACM
enjoys today is built upon a 50-year commitment to high-quality editorial content and a steadfast dedication to advancing the arts,
sciences, and applications of information technology.
ACM, the worlds largest educational
and scientific computing society, delivers
resources that advance computing as a
science and profession. ACM provides the
computing fields premier Digital Library
and serves its members and the computing
profession with leading-edge publications,
conferences, and career resources.
Executive Director and CEO
Bobby Schnabel
Deputy Executive Director and COO
Patricia Ryan
Director, Office of Information Systems
Wayne Graves
Director, Office of Financial Services
Darren Ramdin
Director, Office of SIG Services
Donna Cappo
Director, Office of Publications
Bernard Rous
Director, Office of Group Publishing
Scott E. Delman
ACM CO U N C I L
President
Alexander L. Wolf
Vice-President
Vicki L. Hanson
Secretary/Treasurer
Erik Altman
Past President
Vinton G. Cerf
Chair, SGB Board
Patrick Madden
Co-Chairs, Publications Board
Jack Davidson and Joseph Konstan
Members-at-Large
Eric Allman; Ricardo Baeza-Yates;
Cherri Pancake; Radia Perlman;
Mary Lou Soffa; Eugene Spafford;
Per Stenstrm
SGB Council Representatives
Paul Beame; Jenna Neefe Matthews;
Barbara Boucher Owens
STA F F
Moshe Y. Vardi
[email protected]
Executive Editor
Diane Crawford
Managing Editor
Thomas E. Lambert
Senior Editor
Andrew Rosenbloom
Senior Editor/News
Larry Fisher
Web Editor
David Roman
Rights and Permissions
Deborah Cotton
NE W S
Art Director
Andrij Borys
Associate Art Director
Margaret Gray
Assistant Art Director
Mia Angelica Balaquiot
Designer
Iwona Usakiewicz
Production Manager
Lynn DAddesio
Director of Media Sales
Jennifer Ruzicka
Publications Assistant
Juliet Chance
Columnists
David Anderson; Phillip G. Armour;
Michael Cusumano; Peter J. Denning;
Mark Guzdial; Thomas Haigh;
Leah Hoffmann; Mari Sako;
Pamela Samuelson; Marshall Van Alstyne
CO N TAC T P O IN TS
Copyright permission
[email protected]
Calendar items
[email protected]
Change of address
[email protected]
Letters to the Editor
[email protected]
BOARD C HA I R S
Education Board
Mehran Sahami and Jane Chu Prey
Practitioners Board
George Neville-Neil
W E B S IT E
http://cacm.acm.org
AU T H O R G U ID E L IN ES
http://cacm.acm.org/
REGIONA L C O U N C I L C HA I R S
ACM Europe Council
Dame Professor Wendy Hall
ACM India Council
Srinivas Padmanabhuni
ACM China Council
Jiaguang Sun
EDITORIAL BOARD
Scott E. Delman
[email protected]
Co-Chairs
William Pulleyblank and Marc Snir
Board Members
Mei Kobayashi; Michael Mitzenmacher;
Rajeev Rastogi
VIE W P OINTS
Co-Chairs
Tim Finin; Susanne E. Hambrusch;
John Leslie King
Board Members
William Aspray; Stefan Bechtold;
Michael L. Best; Judith Bishop;
Stuart I. Feldman; Peter Freeman;
Mark Guzdial; Rachelle Hollander;
Richard Ladner; Carl Landwehr;
Carlos Jose Pereira de Lucena;
Beng Chin Ooi; Loren Terveen;
Marshall Van Alstyne; Jeannette Wing
P R AC TIC E
Co-Chair
Stephen Bourne
Board Members
Eric Allman; Peter Bailis; Terry Coatta;
Stuart Feldman; Benjamin Fried;
Pat Hanrahan; Tom Killalea; Tom Limoncelli;
Kate Matsudaira; Marshall Kirk McKusick;
George Neville-Neil; Theo Schlossnagle;
Jim Waldo
The Practice section of the CACM
Editorial Board also serves as
.
the Editorial Board of
C ONTR IB U TE D A RTIC LES
Co-Chairs
Andrew Chien and James Larus
Board Members
William Aiello; Robert Austin; Elisa Bertino;
Gilles Brassard; Kim Bruce; Alan Bundy;
Peter Buneman; Peter Druschel; Carlo Ghezzi;
Carl Gutwin; Yannis Ioannidis;
Gal A. Kaminka; James Larus; Igor Markov;
Gail C. Murphy; Bernhard Nebel;
Lionel M. Ni; Kenton OHara; Sriram Rajamani;
Marie-Christine Rousset; Avi Rubin;
Krishan Sabnani; Ron Shamir; Yoav
Shoham; Larry Snyder; Michael Vitale;
Wolfgang Wahlster; Hannes Werthner;
Reinhard Wilhelm
RES E A R C H HIGHLIGHTS
Co-Chairs
Azer Bestovros and Gregory Morrisett
Board Members
Martin Abadi; Amr El Abbadi; Sanjeev Arora;
Nina Balcan; Dan Boneh; Andrei Broder;
Doug Burger; Stuart K. Card; Jeff Chase;
Jon Crowcroft; Sandhya Dwaekadas;
Matt Dwyer; Alon Halevy; Norm Jouppi;
Andrew B. Kahng; Sven Koenig; Xavier Leroy;
Steve Marschner; Kobbi Nissim;
Steve Seitz; Guy Steele, Jr.; David Wagner;
Margaret H. Wright; Andreas Zeller
| J U LY 201 6 | VO L . 5 9 | NO. 7
REC
PL
NE
E
I
SE
CL
TH
Chair
James Landay
Board Members
Marti Hearst; Jason I. Hong;
Jeff Johnson; Wendy E. MacKay
WEB
M AGA
editors letter
DOI:10.1145/2945075
Moshe Y. Vardi
Join ACM-W: ACM-W supports, celebrates, and advocates internationally for the full engagement of women in
all aspects of the computing field. Available at no additional cost.
Priority Code: CAPP
Payment Information
Name
ACM Member #
Mailing Address
Total Amount Due
City/State/Province
ZIP/Postal Code/Country
Credit Card #
Exp. Date
Signature
Purposes of ACM
ACM is dedicated to:
1) Advancing the art, science, engineering, and
application of information technology
2) Fostering the open interchange of information
to serve both professionals and the public
3) Promoting the highest professional and
ethics standards
Satisfaction Guaranteed!
[email protected]
acm.org/join/CAPP
cerfs up
DOI:10.1145/2949336
Vinton G. Cerf
Rethinking
Computational Thinking
Results of
ACMs 2016
General
Election
President:
Vicki L. Hanson
(term: July 1, 2016 June 30, 2018)
Vice President:
Cherri Pancake
(term: July 1, 2016 June 30, 2018)
Secretary/Treasurer:
Elizabeth Churchill
(term: July 1, 2016 June 30, 2018)
Members at Large:
Gabriele Anderst-Kotsis
(term: July 1, 2016 June 30, 2020)
Susan Dumais
(term: July 1, 2016 June 30, 2020)
Elizabeth Mynatt
(term: July 1, 2016 June 30, 2020)
Pam Samuelson
(term: July 1, 2016 June 30, 2020)
Eugene H. Spafford
(term: July 1, 2016 June 30, 2020)
| J U LY 201 6 | VO L . 5 9 | NO. 7
change the perception of the American public, but such change is essential and should be embraced as
quickly as possible and on a national
basis. ACM is best positioned and able
to provide the leadership needed to
move this important step forward for
the overall discipline of computer science.
James Geller, Newark, NJ
Computational Biology
in the 21st Century
Smart Cities
Debugging
Distributed Systems
Skills for Success Across
Three IT Career Stages
Adaptive Computation: In
Memory of John Holland
Ur/Web: A Simple Model
for Programming the Web
Verifying Quantitative
Reliability for Programs
that Execute on
Unreliable Hardware
Plus the latest news about deep
reinforcement learning, the
value of open source,
and apps for social good.
DOI:10.1145/2933410
http://cacm.acm.org/blogs/blog-cacm
Progress in
Computational Thinking,
and Expanding
the HPC Community
Jeannette Wing considers the proliferation of
computational thinking, while Dan Stanzione
hopes to bring more HPC practitioners to SC16.
Jeannette Wing
Computational
Thinking,
10 Years Later
http://bit.ly/1WAXka7
March 23, 2016
Not in my lifetime.
That is what I said when I was asked
whether we would ever see computer
science taught in K12. It was 2009, and
I was addressing a gathering of attendees to a workshop on computational
thinking (http://bit.ly/1NjmcRJ) convened by the National Academies.
I am happy to say that I was wrong.
It has been 10 years since I published
my three-page Computational Thinking Viewpoint (http://bit.ly/1W73ekv)
in the March 2006 issue of Communications. To celebrate its anniversary, let us
consider how far weve come.
Think back to 2005. Since the dotcom bust, there had been a steep and
steady decline in undergraduate enrollments in computer science, with
10
COMMUNICATIO NS O F TH E AC M
| J U LY 201 6 | VO L . 5 9 | NO. 7
blog@cacm
guidance for the national curriculum
says, A high-quality computing education equips pupils to use computational
thinking and creativity to understand
and change the world.
In addition, the BBC in partnership
with Microsoft and other companies
funded the design and distribution of
the BBC micro:bit (https://www.microbit.co.uk/). One million of these programmable devices were distributed
free earlier this year (March 2016), one
for every 1112-year-old (Year 7) student
in the U.K., along with their teachers.
Microsoft Research contributed to the
design and testing of the device, and the
MSR Labs Touch Develop team provided a programming language and platform for the BBC micro:bit, as well as
teaching materials.
Second, code.org is a nonprofit organization, started in 2013, dedicated
to the mission of providing access to
computer science education to all. Microsoft, along with hundreds of other
corporate and organizational partners,
helps sponsor the activities of code.org.
Third, internationally there is a
groundswell of interest in teaching
computer science at the K12 level. I
know of efforts in Australia, Israel, Singapore, and South Korea. China is likely
to make a push soon, too.
Computer Science for All
Most gratifying to me is President
Barack Obamas pledge to provide $4
billion in funding for computer science education in U.S. schools as part
of the Computer Science for All Initiative (http://1.usa.gov/21u4mxK) he
announced on Jan. 30. That initiative
includes $120 million from the National Science Foundation, which will
be used to train as many as 9,000 more
high school teachers to teach computer
science and integrate computational
thinking into their curriculum. This
push for all students to learn computer
science comes partly from market demand for workers skilled in computing
from all sectors, not just information
technology. We see this at Microsoft,
too; our enterprise customers in all sectors are coming to Microsoft because
they need more computing expertise.
Practical challenges and research opportunities remain. The main practical
challenge is that we do not have enough
K12 teachers trained to teach comput-
er science to K12 students. I am optimistic that, over time, we will solve this.
There also are interesting research
questions I would encourage computer
scientists to pursue, working with the
cognitive and learning sciences communities. First, what computer science concepts should be taught when, and how?
Consider an analogy to mathematics. We teach numbers to 5-year-olds,
algebra to 12-year-olds, and calculus
to 18-year-olds. We have somehow figured out the progression of concepts to
teach in mathematics, where learning
one new concept builds on understanding the previous concept, and where the
progression reflects the progression of
mathematical sophistication of a child
as he or she matures.
What is that progression in computer science? For example, when is it
best to teach recursion? Children learn
to solve the Towers of Hanoi puzzle (for
small n), and in history class we teach
divide and conquer as a strategy for
winning battles. But is the general concept better taught in high school? We
teach long division to 9-year-olds in 4th
grade, but we never utter the word algorithm. And yet the way it is taught,
long division is just an algorithm. Is
teaching the general concept of an
algorithm too soon for a 4th grader?
More deeply, are there concepts in
computing that are innate and do not
need to be formally learned?
Second, we need to understand how
best to use computing technology in the
classroom. Throwing computers in the
classroom is not the most effective way
to teach computer science concepts.
How can we use technology to enhance
the learning and reinforce the understanding of computer science concepts? How can we use technology to
measure progress, learning outcomes,
and retention over time? How can we
use technology to personalize the learning for individual learners, as each of us
learns at a different pace and has different cognitive abilities?
We have made tremendous progress in injecting computational thinking into research and education of
all fields in the last 10 years. We still
have a ways to go, but fortunately,
academia, industry and government
forces are aligned toward realizing the
vision of making computational thinking commonplace.
Dan Stanzione
SC16 Expands Focus
on HPC Provider
Community,
Practitioners
http://bit.ly/1REKjKl
April 6, 2016
If you are in HPC (high-performance computing) or a related field, you know SC16
(http://sc16.supercomputing.org/)
as
the leading international conference for
high-performance computing, networking, storage, and analysis. For 28 years,
SC has served as the conference of record
in the supercomputing community for
presenting the results of groundbreaking
new research, getting the training needed
to advance your career, and discovering
what is new in the marketplace.
SC16 marks the beginning of a multiyear emphasis designed to advance the
state of the practice in the HPC community by providing a track for professionals driving innovation and development
in designing, building, and operating
the worlds largest supercomputers,
along with the system and application
software that make them run effectively.
We call this the State of the Practice.
State of the Practice submissions
will add content about innovations
and best practices development from
the HPC service provider community
into all aspects of the SC16 technical
program (http://bit.ly/1T0Z6yx), from
tutorials and workshops to papers and
posters. These submissions will be
peer reviewed, as are all submissions
to SC. However, the evaluation criteria
will acknowledge the fundamental difference between innovation that leads
the state of HPC practice in the field
today, and research results that will reshape the field tomorrow.
If you are part of the SC community
but have not always felt SC was the right
venue to showcase your contributions to
the HPC body of knowledge, we want to
encourage you to submit to the technical program on the reinvigorated State
of the Practice track.
Check the important dates page
(http://bit.ly/1WaLX9j) for upcoming
submission deadlines.
Jeannette M. Wing is corporate vice president at
Microsoft Research. Dan Stanzione is Executive Director
of the Texas Advanced Computing Center at the University
of Texas at Austin and serves as co-chair of the State of
the Practice track at SC16.
2016 ACM 0001-0782/16/07 $15.00
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
11
news
Science | DOI:10.1145/2933412
Neil Savage
Graph Matching in
Theory and Practice
A theoretical breakthrough in graph isomorphism excites
complexity experts, but will it lead to any practical improvements?
two scientists
wrote a seminal textbook
on computational complexity theory, describing how
some problems are hard to
solve. The known algorithms for handling them grow in complexity so fast
that no computer can be guaranteed to
solve even moderately sized problems
in the lifetime of the universe. While
most problems could be deemed either relatively easy or hard for a computer to solve, a few fell into a strange
nether region where they could not
be classified as either. The authors,
Michael Garey and David S. Johnson,
helpfully provided an appendix listing
a dozen problems not known to fit into
one category or the other.
The very first one thats listed is
graph isomorphism, says Lance Fortnow, chair of computer science at the
Georgia Institute of Technology. In the
decades since, most of the problems
on that list were slotted into one of the
two categories, but solving graph isomorphismin essence figuring out if
two graphs that look different are in
fact the sameresisted categorization.
Graph isomorphism just didnt fall.
Now Lszl Babai, a professor of
computer science and mathematics
ACK I N 1 9 7 9,
12
at the University of Chicago, has developed a new algorithm for the problem
that pushes it much closer tobut not
all the way intothe easy category, a
result complexity experts are hailing
as a major theoretical achievement, although whether his work will have any
practical effect is unclear.
| J U LY 201 6 | VO L . 5 9 | NO. 7
What Babais
solution does is
check subsections
of graphs for
isomorphism through
a variety of relatively
simple means.
ACM
Member
News
INDYK STRIVES
TO BOOST EFFICIENCY
OF ALGORITHMS
My research
interests are
generally in the
design and
analysis of
efficient
algorithms,
says Piotr Indyk of the Theory of
Computation Group of the
Massachusetts Institute of
Technology (MIT) Computer
Science and Artificial
Intelligence Laboratory.
I am interested in
developing solutions to
fundamental algorithmic
problems that impact the
practice of computing, adds
Indyk. Some examples include
high-dimensional similarity
search, which involves finding
pairs of high-dimensional
vectors that are close to each
other; faster Fourier transform
algorithms for signals with
sparse spectra, and so forth.
Born in Poland, Indyk
received a Magister degree from
the University of Warsaw in
1995, and a Ph.D. in computer
science from Stanford University
in 2000. That same year, he
joined the faculty of MIT, where
he has been ever since.
Recently, I managed to
make some progress on whether
we can show, or at least provide
evidence, that some of those
problems cannot be solved
faster than what is currently
known, Indyk said. I have also
been working on proving there
are certain natural problems
like regular expression
matching, for examplethat
require quadratic time to
solve, assuming some natural
conjectures or hypothesis. This
has been my main focus over
the past year, which is a little
unusual for me. This is because
I typically work on algorithms
and try to make them faster, as
opposed to showing they might
not be improved.
During an upcoming
sabbatical at Stanford
University, Indyk plans to work
with researchers who develop
tools for proving conditional
hardness of certain problems.
Stanford is my alma mater, so I
am looking forward to it.
John Delaney
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
13
news
there is evidence it is not NP-complete
either. The best theoretical algorithm,
published by Babai and Eugene Luks
in 1983, was sub-exponentialbetter
than any known algorithm for any NPcomplete problem, but still far from
easy. This new algorithm appears to
place graph isomorphism in quasipolynomial time, much closer to the
easy side. This is just such a huge
theoretical improvement from what we
had before, Fortnow says. Now its
not likely at all to be NP-complete and
its not that hard at all.
Josh Grochow, an Omidyar Postdoctoral Fellow in theoretical computer
science at the Santa Fe Institute in New
Mexico, calls Babais proposed solution a big theoretical jump. There really are very few problems we know of
in this limbo state, he says. Graph isomorphism is still in the limbo state,
but its a lot clearer where it sits.
In some subset of graphs, the question was already settled. For instance,
molecular graphs have bounded
valence, meaning the physical constraints of three-dimensional space
allow atoms to be connected only to a
limited number of other atoms, says
Jean-Loup Faulon, a synthetic biologist
at the University of Manchester, England, and INRA in France, who uses
graph matching in his work. In the
1980s, he points out, Babai, Luks, and
Christoph Hoffmann showed bounded valence graphs could be solved in
polynomial time. Therefore, the problem of graph isomorphism is already
known to be polynomial for chemicals, Faulon says.
While Babais work has generated
much excitement among theoreticians, experts say it is not likely to have
much effect in practice; at least, not immediately. Scott Aaronson, a computer
scientist at the Massachusetts Institute
of Technology who studies computational theory, says, this is obviously
the greatest theoretical algorithms result at least since 2002, when Manindra Agrawal, Neeraj Kayal, and Nitin
Saxena came up with an algorithm
for determining, in polynomial time,
whether a number is a prime. On the
other hand, Aaronson says, the immediate practical impact on computing is zero, because we already had
graph isomorphism algorithms that
were extremely fast in practice for any
14
COMMUNICATIO NS O F TH E AC M
| J U LY 201 6 | VO L . 5 9 | NO. 7
news
Technology | DOI:10.1145/2933416
Marina Krakovsky
Accelerating Search
The latest in machine learning helps high-energy physicists handle
the enormous amounts of data produced by the Large Hadron Collider.
Large
Hadron Collider (LHC), the
particle accelerator most
famous for the Nobel Prizewinning discovery of the elusive Higgs boson, is massivefrom its
sheer size to the grandeur of its ambition to unlock some of the most fundamental secrets of the universe. At 27 kilometers (17 miles) in circumference,
the accelerator is easily the largest machine in the world. This size enables
the LHC, housed deep beneath the
ground at CERN (the European Organization for Nuclear Research) near Geneva, to accelerate protons to speeds
infinitesimally close to the speed of
light, thus creating proton-on-proton
collisions powerful enough to recreate
miniature Big Bangs.
The data about the output of these
collisions, which is processed and analyzed by a worldwide network of computing centers and thousands of scientists,
is measured in petabytes: for example,
one of the LHCs main pixel detectors,
the ultra-durable high-precision cameras that capture information about
these collisions, records an astounding
40 million pictures per secondfar too
much to store in its entirety.
This is the epitome of big datayet
when we think of big data, and of the
machine-learning algorithms used to
make sense of it, we usually think of applications in text processing and computer vision, and of uses in marketing
by the likes of Google, Facebook, Apple, and Amazon. The center of mass
in applications is elsewhere, outside
of the physical and natural sciences,
says Isabelle Guyon of the University
of Paris-Saclay, who is the universitys
chaired professor in big data. So even
though physics and chemistry are very
important applications, they dont get
as much attention from the machine
learning community.
Guyon, who is also president of
ChaLearn.org, a non-profit that organizes machine-learning competi-
VE RY T H I N G ABOU T T H E
Workers insert a new CMS Beam Pipe during maintenance on the Large Hadron Collider.
tions, has worked to shift data scientists attention toward the needs of
high-energy physics. The Higgs Boson Machine Learning Challenge that
she helped organize in 2014, which
officially required no knowledge of
particle physics, had participants sift
data from hundreds of thousands of
simulated collisions (a small dataset by LHC standards) to infer which
collisions contained a Higgs boson,
the final subatomic particle from the
Standard Model of particle physics for
which evidence was observed.
The Higgs Boson Machine Learning Challenge, despite a modest top
prize of $7,000, attracted over 1,000
contenders, and in the end physicists
were able to learn a thing or two from
the data scientists, such as the use of
cross-validation to avoid the problem
of overfitting a model to just one or
two datasets, according to Guyon. But
even before this high-profile competition, high-energy physicists working at
LHC had been using machine-learning tools to hasten their research. The
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
15
news
process starts with the trigger system,
which immediately after each collision determines if information about
the event is worth keeping (most is discarded), and goes out to the detector
level, where machine learning helps
reconstruct events. Farther down the
line, machine learning aids in optimal
data placement by predicting which of
the datasets will become hot; replicating these datasets across sites ensures
researchers across the Worldwide LHC
Computing Grid (a global collaboration
of more than 170 computer centers in
42 countries) have continuous access to
even the most popular data. There are
billions of examples [of machine learning] that are in this gameit is everywhere in what we do, Doneg says.
At the heart of the colliders efforts,
of course, is the search for new particles. In data processing terms, that
search is a classification problem, in
which machine-learning techniques
such as neural networks and boosted
decision trees help physicists tease
out scarce and subtle signals suggesting new particles from the vast background, or the multitude of known,
and therefore uninteresting, particles
coming out of the collisions. That is a
difficult classification problem, says
University of California, Irvine computer science professor Pierre Baldi, an
ACM Fellow who has applied machine
learning to problems in physics, biology, and chemistry.
Because the signal is very faint,
you have a very large amount of data,
and the Higgs boson [for example] is a
very rare event, youre really looking for
needles in a haystack, Baldi explains,
using most researchers go-to metaphor for the search for rare particles.
He contrasts this classification problem with the much more prosaic task
of having a computer distinguish male
faces from female faces in a pile of images; that is obviously a classification
problem, too, and classifying images
by gender can, in borderline cases, be
tricky, but most of the time its relatively easy.
These days, the LHC has no shortage of tools to meet the challenge. For
example, one algorithm Doneg and
his colleagues use focuses just on background reduction, a process whose goal
is to squeeze down the background as
much as possible so the signal can
16
Spiropulu believes
machine learning will
enable physicists to
push the frontier of
their field beyond the
Standard Model of
particle physics.
| J U LY 201 6 | VO L . 5 9 | NO. 7
news
Education | DOI:10.1145/2933418
Lawrence M. Fisher
Booming Enrollments
The Computing Research Association works to quantify the extent,
and causes, of a jump in undergraduate computer science enrollments.
Having&big&
impact&with&
signicant&
challenges&to&
unit&
Beginning&to&
impact&unit&
No&no8ceable&
Have&seen&
increase&
increase,&but&
have&managed&
so&far&
Other&
123&Doctoral&depts;&2/3&public&
70&Nondoctoral&depts;&2/3&private&
INTRO&
M$$$A$$$J$$$O$$$R$$$S$
MID&
UPPER
LEVEL&
LEVEL&
INTRO&
RQD&
NDC&
DOC&
DOC&
Don't&Know&
NDC&
Stable&
NDC&
NDC&
DOC&
NDC&
Somewhat&incr&
DOC&
NDC&
DOC&
Signicantly&incr&
NDC&
100%&
90%&
0%&
70%&
60%&
50%&
40%&
30%&
20%&
10%&
0%&
DOC&
DOC&
N$$$$O$$$$$N$$$$$$$$M$$$$$A$$$$$J$$$$$O$$$$$R$$$$$S$
INTRO&
UPPER&
MID&
NOT&
LEVEL& LEVEL&
RQD&
class sizes and, if so, what is the impact on quality of instruction? Are they
increasing faculty sizes? What other
strategies are being used?
The hope, Davidson said, is that
answers to these questions will give
university administrators and computing departments insights into the extent of the boom, and enable them to
develop better strategies to managing
booming enrollments.
The study includes data acquired
from several sources, including sources involved in the annual CRA Taulbee
survey (the principal source of infor-
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
17
news
http://inside.mines.edu/~tcamp/SIGCSE2016_Boom.pdf
Among the preliminary results to
be gleaned from the institutions surveyed (key findings will be presented at
CRA Snowbird 2016, with a final report
planned for the fall):
About two-thirds of 123 doctoral departments and one-third of 70 non-doctoral departments surveyed reported
increasing undergraduate enrollments
were having a big impact on them, resulting in significant challenges.
About 80% of doctoral programs reported significant increases in demand
for introductory courses in a CS or CE
major; less than half of participating
non-doctoral programs reported similarly significant increases.
Increases in undergraduate enrollments were seen as creating problems in at least 40% of the departments surveyed. Most (78%) doctoral
departments had issues with classroom space, followed by the availability of sufficient faculty (69%), sufficient
teaching assistants (61%), and faculty
workloads (61%). In non-doctoral programs, the most frequently reported
concerns were sufficient faculty (44%)
and faculty workload (42%).
In response to those concerns,
more than 80% of doctoral departments increased the size of classes and
the number of sections offered during
the academic year. More than 40% of
non-doctoral departments reported increasing class size, and more than 60%
reported increasing the number of sections offered in a school year.
In terms of staffing, more than
70% of doctoral departments reported increased use of undergraduate
teaching assistants, while more than
60% reported the increased use of adjuncts/visiting faculty, having graduated students teach, or increasing
the teaching faculty. More than 40%
of non-doctoral departments reported increased use of adjuncts/visiting
faculty; another 42% said they would
like to expand their tenure-track faculty, but cannot.
In the context of diversity, no adverse effects on recruitment or retention were reported, but only 35%40%
of responding departments said they
explicitly consider the impact on diversity when choosing actions. Diversity
concerns have not prevented or nul18
The survey
responses are
giving us a lot
of information
on how universities
are handling
the boom and
what the biggest
concerns are.
20%
I didnt enjoy the professors teaching style
| J U LY 201 6 | VO L . 5 9 | NO. 7
45%
33%
It was too challenging
75%
Women
news
Society | DOI:10.1145/2933414
Keith Kirkpatrick
Legal Advice on
the Smartphone
New apps help individuals contest traffic, parking tickets.
The city has always given individuals the ability to fight their tickets,
whether through the mail, in person, or
online, says Christian Fama, a co-owner of WinIT and an executive at Empire
Commercial Services. If a ticket isnt
dismissible, the judge doesnt dismiss
it. However, the app, which allows
one to see what types of defenses are
available, provides users a way to easily
evaluate whether a challenge to a ticket
is likely to be successful without going
through the entire process of contesting a ticket.
If WinIT gets the ticket dismissed,
the user pays the company 50% of the
value of the fine; if the ticket is not dismissed, the user simply needs to pay
the fine in full, and owes WinIT no fee.
WinIT began its testing phase in
March 2015, and became available for
public use three months later. WinIT
currently processes hundreds of tickets
a day, and credits the success of the app
to Empires decades of success with
fighting tickets for commercial clients.
Fama says WinITs success rate has
exceeded his initial expectations, and
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
19
news
20
Screenshots from WinIT (left) and Fixed, two apps that help users contest parking tickets.
issues, including business start-up issues, immigration issues, name changes, or other relatively straightforward
topics that generally still require some
legal input or review.
Natacha Gaymer-Jones, the founder
of LegalTap, says the app facilitates a
15-minute consultation with a lawyer
for just $39, with the option to schedule
a more in-depth conversation at a later
date. The application was launched on
iOS and Android in June 2015, and as
of February 2016, has generated more
than 1,500 calls on topics ranging from
parking ticket issues and immigration
issues to business issues and more.
We wanted to provide people with a
way to address quick legal issues, by providing an app thats accessible and also
at a price point that was understandable, Gaymer-Jones says, noting the
application also helps lawyers quickly
vet potential clients to see if they are a
good fit, and provides an easy way for
lawyers to make money without giving
away a free hour of legal consultation.
The app features a pre-programmed
algorithm that matches the type of
query with lawyers that have signed up
to be part of the LegalTap network. The
system is designed to match only attorneys qualified in the appropriate area
of law to clients needs; a question on
| J U LY 201 6 | VO L . 5 9 | NO. 7
news
felt accessible parking violations were
not adequately being addressed by law
enforcement. However, the apps appeal
extends beyond those who are directly or
indirectly affected by disabilities.
We have people who dont identify
with the disability communitythey
just see this as an issue, Marsh says.
Typically, they download the app because they are angry when they see a
violation, and they see theres no other
way to adequately address the issue,
since these parking violations are often
of low priority to police departments.
Once downloaded, the app asks
users to capture three photos of an
alleged violation (one from the front
capturing the lack of an accessibleparking placard, one from the rear to
capture the license plate, and a third
photo capturing the violation itself).
Marsh says the app makes it easier
to collect verified instances of accessible parking violations, since it captures photos and stamps them with
the phones geolocation data, along
with time and date information.
Parking Mobility currently has
more than 500,000 users worldwide,
broken into two groups. Casual users
can be located anywhere, and their violation reports are automatically collected by Parking Mobility, which then
passes along the data to the relevant
municipalities to highlight the prob-
Milestones
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
21
viewpoints
DOI:10.1145/2935878
Legally Speaking
Apple v. Samsung and
the Upcoming Design
Patent Wars?
Assessing an important recent design
patent infringement court decision.
22
a product in which the design is embodied, or only those profits that are
attributable to the infringement. The
Supreme Court has decided to address
the second issue, but not the first.
Apple v. Samsung
The most recent of several litigations
between Apple and Samsung involves
three design patents that cover specific
portions of the external configuration
of smartphone designs: a black rectangular round-cornered front face for the
device; a substantially similar rectangular round-cornered front face with a
surrounding rim or bezel; and a colorful grid of 16 icons to be displayed on
a screen.
Apple sued Samsung for infringing
these patents, as well as for infringing
trade dress rights in the external design
of the iPhone. A jury found that Samsung had infringed both Apples design
patents and trade dress rights and ordered Samsung to pay $930 million in
damages for these infringements.
| J U LY 201 6 | VO L . 5 9 | NO. 7
Pamela Samuelson
The Court of Appeals for the Federal
Circuit (CAFC) overturned the trade
dress claim saying that the external
design of the Apple smartphone was
too functional to qualify for trade dress
protection. The rounded corners and
bezel were, it ruled, designed to make
smartphones more pocketable and
to protect against breakage when the
phone was dropped. The icon displays
promote usability by communicating
to the user which functionalities they
can invoke by touching the icons.
The Supreme Court has held that
trade dress, such as product configurations, is too functional to be protectable if it is essential to the use or
purpose of the article or if it affects the
cost or quality of the article, or would
put competitors at a significant nonreputational-related
disadvantage.
Under this standard, the Apple trade
dress claims failed.
However, the CAFC affirmed the design patent infringement ruling, rejecting Samsungs argument that the same
features the CAFC thought were too
functional for trade dress protection
made them ineligible for design patenting too. The CAFC disagrees with
the proposition that the functionality
test for design patents is the same as
for trade dress.
Somewhat to Samsungs relief, the
CAFC ordered the damage award to
be cut to $399 million. But even this
amount, Samsung argues, is excessive because it represents all of the
profits that Samsung has made in
selling the phones embodying the
patented designs.
viewpoints
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
23
viewpoints
as a whole. Although Samsungs arguments have some merit, the Supreme
Court has decided not to review either
the functionality issue or the proper
test for infringement in the Apple v.
Samsung case.
Total Profits As Windfall
An even more urgent concern driving
Samsungs plea for Supreme Court review arises from the CAFCs approval
of an award of all of its profits from
selling the smartphones that infringed
those three design patents.
The CAFC acknowledged that a total
profits award for infringement of the
design patents in this case was difficult
to justify as a matter of equity. However, it held that the statute required approval of a total profits award.
It relied on this part of the relevant
statute, 35 U.S.C. 289: Whoever during the term of a patent for a design,
without license of the owner applies
the patented design to any article of
manufacture shall be liable to the
owner to the extent of his total profit,
but not less than $250.
The statute plainly speaks about
total profit as a suitable award for
infringement of a design patent. But
what is the relevant article of manufacture?
The CAFC decided the relevant article of manufacture in Apple v. Samsung was the smartphone itself, not
just the subparts covered by the design patents. After all, no one would
buy only the design-patented screen
with icons or round-shaped rim with
a bezel. People buy a whole smartphone. This explains why the CAFC
thought the smartphone was the article of manufacture whose total profits courts must award when design
patents are infringed.
This interpretation of design patent awards is inconsistent with principles that guide damage awards in
other types of IP cases. Had Samsung
infringed a utility patent, a copyright,
or protectable trade dress, an award
of monetary damages would be based
on the harm that was attributable to
the infringement. A total profits award
for utility patent infringement, for instance, would only be available if there
was evidence the patented feature was
responsible for the market demand for
the product embodying it.
24
This interpretation
of design patent
awards is
inconsistent with
principles that guide
damage awards
in other types
of IP cases.
| J U LY 201 6 | VO L . 5 9 | NO. 7
ent too? Would a second award of total profits be fair, or would the first
patentees windfall have exhausted
the available damages?
More concretely, consider this hypothetical. Apple owns a design patent
on the musical note icon for smartphones. Samsung is not charged with
infringing that patent. But suppose
the only design claim against Samsung
pertained to that patent.
An IP professors amicus brief in
support of Samsungs petition (written by Stanfords Mark Lemley and
joined by yours truly) pointed out that
it would not be reasonable to award the
same $399 million in total profits for
infringement of that one patent.
Demand for iPhones is driven by
many factors. But the music icon is
a very small part of the value of any
smartphone that might embody it. Proportionality should apply in all awards
for monetary compensation for infringing IP rights.
Several amicus briefs filed in
support of Samsungs petition for
Supreme Court review warned that
if the Court failed to overturn the
total profits award ruling in this
case, this would set off a new round
of patent troll litigations. This would
be harmful to innovation and competition in high-tech industries, especially given the low quality of some
issued design patents.
Conclusion
The Supreme Court has decided to
review the total profits recovery question raised by Samsung. On that point,
Samsung seems likely to prevail. Differentiating between ornamental
and functional elements of designs
for articles of manufacture is a trickier
matter, but surely the test for infringement of a design patent should focus
on that design rather than products
as a whole. Unfortunately, the Court
decided not to review this important
question. The Apple v. Samsung case is
a very important one. Resolved well,
it will mitigate design patent wars in
high-tech industries. Resolved badly,
it will surely spark such wars.
Pamela Samuelson ([email protected]) is the
Richard M. Sherman Distinguished Professor of Law and
Information at the University of California, Berkeley.
Copyright held by author.
viewpoints
DOI:10.1145/2935880
Thomas Haigh
Historical Reflections
How Charles Bachman
Invented the DBMS,
a Foundation of
Our Digital World
His 1963 Integrated Data Store set the template for all subsequent
database management systems.
I F T Y - T H R E E Y E A R S AG O a small
team working to automate
the business processes of the
General Electric Company
built the first database management system. The Integrated Data
StoreIDSwas designed by Charles
W. Bachman, who won the ACMs 1973
A.M. Turing Award for the accomplishment. Before General Electric, he had
spent 10 years working in engineering,
finance, production, and data processing for the Dow Chemical Company.
He was the first ACM A.M. Turing
Award winner without a Ph.D., the
first with a background in engineering rather than science, and the first
to spend his entire career in industry
rather than academia.
Some stories, such as the work of
Babbage and Lovelace, the creation of
the first electronic computers, and the
emergence of the personal computer
industry have been told to the public
again and again. They appear in popular books, such as Walter Isaacsons
recent The Innovators: How a Group of
Hackers, Geniuses and Geeks Created
the Digital Revolution, and in museum
exhibits on computing and innovation. In contrast, perhaps because database management systems are rarely
experienced directly by the public,
Figure 1. This image, from a 1962 internal General Electric document, conveyed the idea
of random access storage using a set of pigeon holes in which data could be placed.
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
25
viewpoints
database history has been largely neglected. For example, the index of Isaacsons book does not include entries
for database or for any of the four
people to have won Turing Awards in
this area: Charles W. Bachman and Edgar F. Codd (1981), James Gray (1988),
or Michael Stonebraker (2014).
Thats a shame, because if any technology was essential to the rebuilding of our daily lives around digital
infrastructures, which I assume is
what Isaacson means by the Digital
Revolution, then it was the database
management system. Databases undergird the modern world of online
information systems and corporate
intranet applications. Few skills are
more essential for application developers than a basic familiarity with SQL,
the standard database query language,
and a database course is required for
most computer science and information systems degree programs. Within
ACM, SIGMODthe Special Interest
Group for Management of Datahas
a long and active history fostering database research. Many IT professionals
center their entire careers on database
technology: the census bureau estimates the U.S. alone employed 120,000
database administrators in 2014 and
predicts faster than average growth for
this role.
Bachmans IDS was years ahead of
its time, implementing capabilities
that had until then been talked about
but never accomplished. Detailed functional specifications for the system
were complete by January 1962, and
Bachman was presenting details of the
planned system to his teams in-house
customers by May of that year. It is less
clear from archival materials when the
system first ran, but Bachman tells me
that a prototype installation of IDS was
tested with real data in the summer of
1963, running twice as fast as a custombuilt manufacturing control system
performing the same tasks.
The details of IDS, Bachmans life
story, and the context in which it arose
have been explored elsewhere.2,6 In this
column, I focus on two specific questions:
Why do we view IDS as the first database management system, and
What were its similarities and differences versus later systems?
There will always be an element
26
COM MUNICATIO NS O F TH E AC M
If any technology
was essential to the
rebuilding of our daily
lives around digital
infrastructures,
it was the database
management system.
| J U LY 201 6 | VO L . 5 9 | NO. 7
viewpoints
magnetic tape much as they had done
to punched cards, except that tape
storage made sorting much harder.
The formats of tape files were usually
fixed by the code of the application
programs working with the data. Every time a field was added or changed
all the programs working with the file
would need to be rewritten. If applications were integrated, for example, by
treating order records from the sales
accounting system as input for the production scheduling application, the
resulting web of dependencies made
it increasingly difficult to make even
minor changes when business needs
shifted.
The other key challenge was making effective use of random access storage in business application programs.
Sequential tape storage was conceptually simple, and the tape drives themselves provided some intelligence to
aid programmers in reading or writing records. Applications were batchoriented because searching a tape to
find or update a particular record was
too slow to be practical. Instead, master files were periodically updated with
accumulated data or read through to
produce reports. With the arrival, in
the early 1960s, of disk storage a computer could theoretically apply updates one at a time as new data came
in and generate reports as needed
based on current data. Indeed this was
the target application of IBMs RAMAC
computer, the first to be equipped
with a hard disk drive. A programmer
working with a disk-based system
could easily instruct the disk drive to
pull data from any particular platter
or track, but the hard part was figuring
out where on the disk the desired record could be found. The phrase data
base was associated with random access storage but was not particularly
well established, so Bachmans alternative choice of data store would not
have seemed any more or less familiar
at the time.
Without significant disk file management support from the rudimentary
operating systems of the era only elite
programmers could hope to create an
efficient random access application.
Mainstream application programmers
were beginning to shift from assembly
language to high-level languages such
as COBOL, which included high-level
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
27
Figure 2. This drawing, from the 1962 presentation IDS: The Information Processing
Machine We Need, shows the use of chains to connect records. The programmer looped
through GET NEXT commands to navigate between related records until an end-of-set
condition is detected.
28
COMMUNICATIO NS O F TH E ACM
| J U LY 201 6 | VO L . 5 9 | NO. 7
viewpoints
viewpoints
Calendar
of Events
July 48
MobiHoc16: The 17th ACM
International Symposium on
Mobile Ad Hoc Networking and
Computing,
Paderborn, Germany,
Sponsored: ACM/SIG,
Contact: Falko Dressler
Email: [email protected]
July 58
LICS 16: 31st Annual ACM/
IEEE Symposium on Logic in
Computer Science,
New York, NY,
Contact: Eric Koskinen
Email: [email protected]
July 913
ITiCSE 16: Innovation and
Technology in Computer
Science Education
Conference 2016,
Arequipa, Peru,
Sponsored: ACM/SIG,
Contact: Alison Clear
Email: [email protected]
July 1013
HT 16: 27th ACM Conference
on Hypertext and Social Media,
Halifax, NS, Canada,
Sponsored: ACM/SIG,
Contact: Eelco Herder,
Email: [email protected]
July 1113
SPAA 16: 28th ACM Symposium
on Parallelism in Algorithms
and Architectures,
Pacific Grove, CA,
Co-Sponsored: ACM/SIG
July 1113
SCA 16: The ACM SIGGRAPH/
Eurographics Symposium on
Computer Animation,
Zurich, Switzerland,
Sponsored: ACM/SIG,
Contact: Matthias Teschner
Email: teschner@informatik.
uni-freiburg.de
July 1317
UMAP 16: User Modeling,
Adaptation and Personalization
Conference,
Halifax, NS, Canada,
Co-Sponsored: ACM/SIG,
Contact: Julita Vassileva,
Email: [email protected]
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
29
viewpoints
1970s and 1980s, and commercial database management systems based on
this approach were among the most
successful products of the mushrooming packaged software industry.
Bachman spoke memorably in
his 1973 Turing Award lecture of the
Programmer as Navigator, charting a path through the database from
one record to another.3 The network
approach used in IDS required programmers to work with one record
at a time. Performing the same operation on multiple records meant
retrieving a record, processing and if
necessary updating it, and then moving on to the next record of interest
to repeat the process. For some tasks
this made programs longer and more
cumbersome than the equivalent in a
relational system, where a task such as
deleting all records more than a year
old or adding 10% to the sales price of
every item could be performed with a
single command.
IDS and other network systems
encoded what we now think of as the
joins between different kinds of
records as part of the database structure rather than specifying them in
each query and rebuilding them when
the query is processed (see Figure 2).
Bachman introduced a data structure
diagramming, often called the Bachman diagram to describe these relationships.c Hardcoding the relationships between record sets made IDS
much less flexible than later relational systems, but also much simpler to implement and more efficient
for routine operations.
IDS was a useful and practical tool
for business use from the mid-1960s,
while relational systems were not commercially significant until the early
1980s. Relational systems did not become feasible until computers were
orders of magnitude more powerful
than they had been in 1963 and some
extremely challenging implementation issues had been overcome by pioneers such as IBMs System R group
and Berkeleys INGRES team. Even
after relational systems were commerc C.W. Bachman, Data Structure Diagrams,
Data Base 1, 2 (Summer 1969), 410 was very
influential in spreading the idea of data structure diagrams, but internal GE documents
make clear he was using a similar technique
as early as 1962.
30
| J U LY 201 6 | VO L . 5 9 | NO. 7
viewpoints
DOI:10.1145/2935882
Jacob Metcalf
Computing Ethics
Big Data Analytics
and Revision of
the Common Rule
Reconsidering traditional research ethics
given the emergence of big data analytics.
I G D ATA I S
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
31
viewpoints
INTER ACTIONS
Association for
Computing Machinery
32
COMMUNICATIO NS O F TH E AC M
IX_XRDS_ThirdVertical_V01.indd 1
researchers due diligence in identifying and ameliorating potential physiological, psychological, and informational harms to human subjects. The
Common Rule grew out of a regulatory
process initiated by the 1974 National
Research Act, a response to public
scandals in medical and psychological research, including the Nuremberg
Doctors Trial, the Tuskegee syphilis
study, and Milgram experiment on
obedience to authority figures. The
Act led to a commission on humansubjects research ethics that produced
the Belmont Report (1979). The Belmont authors insisted that certain core
philosophical principles must guide
research involving human subjects:
respect for persons, beneficence, and
justice. The HHS developed the specific regulations in the Common Rule as
an instantiation of those principles.12
Importantly, the Belmont authors
understood that not all activities that
produce knowledge or intervene in human lives are research, and not all
research about humans is sensitive or
personal enough to be about human
subjects. To delimit human-subjects
research within biomedicine, the Belmont commission considered the
boundaries between biomedical and
behavioral research and the accepted
and routine practice of medicine.12
This boundary reflects the ethical difficulties posed by unique social roles of
physician-researchers who are responsible for both patient health and societal well-being fostered by research
knowledge. This unique role creates
ethical dilemmas that are often not
reflected in other disciplines. Research
defined by the Belmont Report is, an
activity designed to test an hypothesis,
permit conclusions to be drawn, and
thereby to develop or contribute to
generalizable knowledge. Practice is,
interventions that are designed solely
to enhance the well-being of an individual patient or client and that have
a reasonable expectation of success.12
Not surprisingly, the first draft of the
Common Rule came under attack from
social scientists for lumping together
all forms of human-subjects research
under a single set of regulations that
reflect the peculiarities of biomedical
research.2 Not all research has the same
risks and norms as biomedicine. A
single set of rules might snuff out legiti-
| J U LY 201 6 | VO L . 5 9 | NO. 7
3/18/15 3:35 PM
mate lines of inquiry, even those dedicated to social justice ends. The HHS
responded by creating an Exempt
category that allowed human-subjects
research with minimal risk to receive
expedited ethics review. Nevertheless,
there has remained a low-simmering
conflict between social scientists and
IRBs. This sets the stage for debates
over regulating research involving big
data. For example, in her analysis of the
Facebook emotional contagion controversy, Michelle Meyer argues that
big data research, especially algorithmic A/B testing without clear temporal boundaries or hypotheses, clouds
the distinction between practice and
research.8,11 Jacob Metcalf and Kate
Crawford agree this mismatch exists,
but argue that core norms of humansubjects research regulations can still
be applied to big data research.10
Big Data and
the Common Rule Revisions
The Common Rule has typically not
been applied to the core disciplines
of big data (computing, mathematics,
and statistics) because these disciplines are assumed to be conducting
research on systems, not people. Yet
big data has brought these disciplines
into much closer intellectual and economic contact with sensitive human
data, opening discussion about how
the Common Rule applies. The assumptions behind NPRM leaving big
data science out of its purview are empirically suspect.
ExcludedA New Category
Complaints about inconsistent application of the exempt category have
prompted HHS to propose a new category of excluded that would automatically receive no ethical review due to inherently low risk to human subjects
(___.101(b)(2)). Of particular interest
is exclusion of:
research involving the collection
or study of information that has been
or will be acquired solely for non-research activities, or
was acquired for research studies
other than the proposed research study
when the sources are publicly available, or
the information is recorded by the
investigator in such a manner that human subjects cannot be identified, directly or through identifiers linked to
viewpoints
The contentious
history of the
Common Rule is due
in part to its influence
on the tone and
agenda of research
ethics even outside
of its formal purview.
Conclusion
The NPRM improves the Common
Rules application to big data research, but portions of the NPRM with
consequences for big data research
rest on dated assumptions. The contentious history of the Common Rule
is due in part to its influence on the
tone and agenda of research ethics
even outside of its formal purview.
This rare opportunity for significant
revisions should not cement problematic assumptions into the discourse of
ethics in big data research.
References
1. boyd, d. and Crawford, K. Critical questions for big
data. Information, Communication & Society 15, 5
(2012), 662679.
2. Committee on Revisions to the Common Rule for the
Protection of, Board on Behavioral, Cognitive, and
Sensory Sciences, Committee on National Statistics,
et al. Proposed Revisions to the Common Rule for
the Protection of Human Subjects in the Behavioral
and Social Sciences, 2014; http://www.nap.edu/
read/18614/chapter/1.
3. Department of Health and Human Services Code of
Federal Regulations Title 45Public Welfare, Part
46Protection of Human Subjects. 45 Code of Federal
Regulations 46, 2009; http://www.hhs.gov/ohrp/
humansubjects/guidance/45cfr46.html.
4. Department of Health and Human Services. Notice
of Proposed Rule Making: Federal Policy for the
Protection of Human Subjects. Federal Register,
2015; http://www.gpo.gov/fdsys/pkg/FR-2015-09-08/
pdf/2015-21756.pdf.
5. Hauge, M.V. et al. Tagging Banksy: Using geographic
profiling to investigate a modern art mystery. Journal
of Spatial Science (2016): 16.
6. King, J.L. Humans in computing: Growing responsibilities
for researchers. Commun. 58, 3 (Mar. 2015), 3133.
7. Kitchin, R. Big data, new epistemologies and paradigm
shifts. Big Data & Society 1, 1 (2014).
8. Kramer, A., Guillory, J., and Hancock, J. Experimental
evidence of massive-scale emotional contagion through
social networks. In Proceedings of the National
Academy of Sciences 111, 24 (2014), 87888790.
9. Metcalf, J. Letter on Proposed Changes to the
Common Rule. Council for Big Data, Ethics, and Society
(2016); http://bdes.datasociety.net/council-output/
letter-on-proposed-changes-to-the-common-rule/.
10. Metcalf, J. and Crawford, K. Where are human
subjects in big data research? The emerging ethics
divide. Big Data & Society 3, 1 (2016), 114.
11. Meyer, M.N. Two cheers for corporate experimentation:
The a/b illusion and the virtues of data-driven innovation.
Colorado Technology Law Journal 13, 273 (2015).
12. National Commission for the Protection of Human
Subjects, of Biomedical and Behavioral Research
and The National Commission for the Protection of
Human Subjects (1979) The Belmont Report: Ethical
Principles and Guidelines for the Protection of Human
Subjects of Research; http://www.hhs.gov/ohrp/
humansubjects/guidance/belmont.html.
13. Zwitter, A. Big data ethics. Big Data & Society 1, 2 (2014).
Jacob Metcalf ([email protected]) is a Researcher
at the Data & Society Research Institute, and Founding
Partner at the ethics consulting firm Ethical Resolve.
This work is supported in part by National Science
Foundation award #1413864. See J. Metcalf Letter on
Proposed Changes to the Common Rule. Council for Big
Data, Ethics, and Society (2016)9 for the public comment
on revisions to the Common Rule published collectively by
the Council for Big Data, Ethics and Society. This column
represents only the authors opinion.
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
33
viewpoints
DOI:10.1145/2838729
Toby Walsh
Viewpoint
Turings Red Flag
A proposal for a law to prevent artificial intelligence
systems from being mistaken for humans.
a good place
to see what the future
looks like. According to
Robert Wallace, a retired
director of the CIAs Office of Technical Service: When a
new James Bond movie was released,
we always got calls asking, Do you have
one of those? If I answered no, the next
question was, How long will it take you
to make it? Folks didnt care about the
laws of physics or that Q was an actor in a
fictional serieshis character and inventiveness pushed our imagination 3 As
an example, the CIA successfully copied the shoe-mounted spring-loaded
and poison-tipped knife in From Russia
With Love. Its interesting to speculate
on what else Bond movies may have led
to being invented.
For this reason, I have been considering what movies predict about the
future of artificial intelligence (AI). One
theme that emerges in several science
fiction movies is that of an AI mistaken
for human. In the classic movie Blade
Runner, Rick Deckard (Harrison Ford)
tracks down and destroys replicants
that have escaped and are visually indistinguishable from humans. Tantalizingly, the film leaves open the
question of whether Rick Deckard is
himself a replicant. More recently,
the movie Ex Machina centers around
a type of Turing Test in which the robot Ava tries to be convincingly human
enough to trick someone into helping
her escape. And in Metropolis, one of
the very first science fiction movies
ever, a robot disguises itself as the
woman Maria and thereby causes the
workers to revolt.
OVIES CAN BE
34
COMMUNICATIO NS O F TH E ACM
| J U LY 201 6 | VO L . 5 9 | NO. 7
The 19th-century U.K. Locomotive Act, also known as the Red Flag Act, required motorized
vehicles to be preceded by a person waving a red flag to signal the oncoming danger.
viewpoints
it nevertheless has placed the idea of
computers emulating humans firmly
in our consciousness.
As any lover of Shakespeare knows,
there are many dangers awaiting us
when we try to disguise our identity.
What happens if the AI impersonates
someone we trust? Perhaps they will
be able to trick us to do their bidding.
What if we suppose they have humanlevel capabilities but they can only act
at a sub-human level? Accidents might
quickly follow. What happens if we develop a social attachment to the AI? Or
worse still, what if we fall in love with
them? There is a minefield of problems
awaiting us here.
This is not the first time in history
that a technology has come along that
might disrupt and endanger our lives.
Concerned about the impact of motor vehicles on public safety, the U.K.
parliament passed the Locomotive
Act in 1865. This required a person to
walk in front of any motorized vehicle
with a red flag to signal the oncoming
danger. Of course, public safety wasnt
the only motivation for this law as the
railways profited from restricting motor vehicles in this way. Indeed, the
law clearly restricted the use of motor
vehicles to a greater extent than safety alone required. And this was a bad
thing. Nevertheless, the sentiment was
a good one: until society had adjusted
to the arrival of a new technology, the
public had a right to be forewarned of
potential dangers.
Interestingly, this red flag law was
withdrawn three decades later in 1896
when the speed limit was raised to
14mph (approximately 23kph). Coincidently, the first speeding offense, as
well as the first British motoring fatality, the unlucky pedestrian Bridget
Driscoll also occurred in that same year.
And road accidents have quickly escalated from then on. By 1926, the first
year for which records are available,
there were 134,000 cases of serious injury, yet there were only 1,715,421 vehicles on the roads of Great Britain. That
is one serious injury each year for every
13 vehicles on the road. And a century
later, thousands still die on our roads
every year.
Inspired by such historical precedents, I propose that a law be enacted
to prevent AI systems from being mistaken for humans. In recognition of
might make better caregivers and companions for the elderly. However, there
are many more reasons we dont want
computers to be intentionally or unintentionally fooling us. Hollywood provides lots of examples of the dangers
awaiting us here. Such a law would,
of course, cause problems in running
any sort of Turing Test. However, I expect that the current discussion about
replacements for the Turing Test will
eventually move from tests for AI based
on deception to tests that quantify explicit skills and intelligence. Some related legislation has been put into law
for guns. In particular, former California Governor Schwarzenegger signed
legislation in September 2004 that
prohibits the public display of toy guns
in California unless they are clear or
painted a bright color to differentiate
them from real firearms. The purpose
of this law is to prevent police officers
mistaking toy guns for real ones.
The second part of the law states
that autonomous systems need to
identify themselves at the start of any
interaction with another agent. Note
that this other agent might even be
another AI. This is intentional. If you
send your AI bot out to negotiate the
purchase of a new car, you want the bot
also to know whether it is dealing with
a dealer bot or a person. You wouldnt
want the dealer bot to be able to pretend to be a human just because it was
interacting with your bot. The second
part of the law is designed to reduce
the chance that autonomous systems
are accidently mistaken for what they
are not.
Consider four up-and-coming areas
where this law might have bite. First,
consider autonomous vehicles. I find
it a real oversight that the first piece of
legislation that permits autonomous
vehicles on roads, the AB 511 act in Nevada, says nothing at all about such vehicles being identified to other road users as autonomous. A Turing Red Flag
law, on the other hand, would require
an autonomous vehicle identify itself
as autonomously driven both to human
drivers and to other autonomous vehicles. There are many situations where
it could be important to know that another road vehicle is being driven autonomously. For example, when a light
changes we can suppose that an autonomous vehicle approaching the light will
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
35
COMMUNICATIONSAPPS
viewpoints
Access the
latest issue,
past issues,
BLOG@CACM,
News, and
more.
36
| J U LY 201 6 | VO L . 5 9 | NO. 7
tended to give the game away. However, now that the technology is shrinking and autonomous cars are starting
to look more normal, it will become increasingly difficult to know when a car
is being driven autonomously. There
is also a reverse argument. Suppose
autonomous vehicles are (or become)
better drivers than humans. Then just
like today when we demand less experienced and capable learner drivers are
identified, we might demand that human drivers are identified apart from
autonomous drivers. A Turing Red Flag
law will achieve this.
Lets move on to the second area
where a Turing Red Flag law might
start having impact shortly. Consider
virtual assistants like Apples Siri or
Microsofts Cortana. Here is a recent
conversation I had with Siri:
Siri. What can I help you with?
Q. Are you a computer?
Siri. I can neither confirm nor deny
my current existential status.
Q. Are you a human?
Siri. Im an assistant. Thats all that
matters.
Q. Are you an AI?
Siri. Thats a rather personal question.
Based on conversations like these,
it would appear that Siri is coming
close to violating this proposed Turing Red Flag law. It begins its conversations without identifying itself as a
computer, and it answers in a way that,
depending on your sense of humor,
might deceive. At least, in a few years
time, when the dialogue is likely more
sophisticated, you can imagine being
deceived. Of course, few if any people
are currently deceived into believing
that Siri is human. It would only take
a couple of questions for Siri to reveal
that it is not human. Nevertheless, it is
a dangerous precedent to have technology like this in everyday use on millions
of smartphones pretending, albeit
poorly, to be human.
There are also several more trusting
groups that could already be deceived.
My five-year-old daughter has a doll
that uses a Bluetooth connection to Siri
to answer general questions. I am not
so sure she fully appreciates that it is
just a smartphone doing all the clever
work here. Another troubling group are
patients with Alzheimers disease and
other forms of dementia. Paro is a cuddly robot seal that has been trialed as
viewpoints
therapeutic tool to help such patients.
Again, some people find it troubling
that a robot seal can be mistaken for
real. Imagine then how much more
troubling society is going to find it
when such patients mistake AI systems
for humans?
Lets move onto a third example,
online poker. This is a multibilliondollar industry so it is possible to say
that the stakes are high. Most, if not
all, online poker sites already ban
computer bots from playing. Bots have
a number of advantages, certainly over
weaker players. They never tire. They
can compute odds very accurately.
They can track historical play very accurately. Of course, in the current state
of the art, they also have disadvantages
such as understanding the psychology of their opponents. Nevertheless,
in the interest of fairness, I suspect
most human poker players would prefer to know if any of their opponents
was not human. A similar argument
could be made for other online computer games. You might want to know
if youre being killed easily because
your opponent is a computer bot with
lightning-fast reflexes.
I conclude with a fourth example:
computer-generated text. Associated
Press now generates most of its U.S.
corporate earnings reports using a
computer program developed by Automated Insights.1 A narrow interpretation might rule such computer-generated text outside the scope of a Turing
Red Flag law. Text-generation algorithms are typically not autonomous.
Indeed, they are typically not interactive. However, if we consider a longer
time scale, then such algorithms are
interacting in some way with the real
I suspect most
human poker players
would prefer
to know if any
of their opponents
was not human.
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
37
viewpoints
DOI:10.1145/2838730
Viewpoint
Inverse Privacy
38
COM MUNICATIO NS O F TH E AC M
| J U LY 201 6 | VO L . 5 9 | NO. 7
viewpoints
We are interested in scenarios where
a person interacts with an institution,
for example, a shop, a medical office, a
government agency. We say that an infon x is personal to an individual P if (a)
x is related to an interaction between P
and an institution and (b) x identifies
P. A typical example of such an infon is
a receipt for a credit-card purchase by a
customer in a shop.
Define the personal infoset of an
individual P to be the collection of all
infons personal to P. Note that the infoset evolves over time. It acquires new
infons. It may also lose some infons.
But, because of the tangibility restriction, the infoset is finite at any given
moment.
Q: Give me an example of an intangible infon.
A: A fleeting impression that you
have of someone who just walked by
you.
Q: What about information announced but not recorded at a meeting? One can argue that the collective
memory of the participants is a kind of
embodiment.
A: Such a case of unrecorded information becomes less and less common. People write notes, write and
send email messages, tweet, use their
smartphones to make videos, and so
forth. Companies tend to tape their
meetings. Numerous sensors, such as
cameras and microphones, are commonplace and growing in pervasiveness, even in conference rooms. But
yes, there are border cases as far as
tangibility is concerned. At this stage
of our analysis, we abstract them
away.
Q: In the shopping receipt example,
the receipt may also mention the salesclerk that helped the customer.
A: The clerk represents the shop on
the receipt.
Q: But suppose that something
went wrong with that particular purchase, the customer complained that
the salesclerk misled her, and the shop
investigates. In the new context, the
person of interest is the salesclerk. The
same infon turns out to be personal to
more than one individual.
A: This is a good point. The same infon may be personal to more than one
individual but we are interested primarily in contexts where the infon in
question is personal to one individual.
A good solution
to the problem
should not only
provide you
accessibility to
your inversely private
information but
should also make
the access convenient.
Classification
The personal infoset of an individual P
naturally splits into four buckets.
1. The directly private bucket comprises the infons that P has access to
but nobody else does.
2. The inversely private bucket comprises the infons that some party has
access to but P does not.
3. The partially private bucket comprises the infons that P has access to
and a limited number of other parties
do as well.
4. The public bucket comprises the
infons that are public information.
Q: Why do you call the second bucket inversely private?
A: The Merriam-Webster dictionary
defines inverse as opposite in order, nature, or effect. The description
of bucket 2 is the opposite of that of
bucket 1.
Q: As far as I can see, you discuss
just two dimensions of privacy: whom a
given infon is personal to, and who has
access to the infon. The world is more
complex, and there are other dimensions to privacy. Consider for example
the pictures in the directly private
bucket of my infoset that are personal
to me only. Some of the pictures are
clearly more private than others; there
are degrees of privacy.
A: Indeed, we restrict attention to
the two dimensions. But this restricted
view is informative, and it allows us to
carry on our analysis. Recall that we
concentrate here on the big picture
leaving many finer points for later
analysis.
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
39
viewpoints
ACM
ACM Conference
Conference
Proceedings
Proceedings
Now
via
Now Available
Available via
Print-on-Demand!
Print-on-Demand!
Did you know that you can
now order many popular
ACM conference proceedings
via print-on-demand?
Institutions, libraries and
individuals can choose
from more than 100 titles
on a continually updated
list through Amazon, Barnes
& Noble, Baker & Taylor,
Ingram and NACSCORP:
CHI, KDD, Multimedia,
SIGIR, SIGCOMM, SIGCSE,
SIGMOD/PODS,
and many more.
40
| J U LY 201 6 | VO L . 5 9 | NO. 7
viewpoints
on Protecting Consumer Privacy in an
Era of Rapid Change is more nuanced:
Companies should provide reasonable access to the consumer data they
maintain; the extent of access should
be proportionate to the sensitivity of
the data and the nature of its use.8 To
this end, we posit:
The Inverse Privacy Entitlement
Principle. As a rule, individuals are entitled to access their personal infons. There
may be exceptions, but each such exception needs to be justified, and the burden
of justification is on the proponents of the
exception.
One obvious exception is related to
national security. The proponents of
that exception, however, would have to
justify it. In particular, they would have
to justify which parts of national security fall under the exception.
The Inverse Privacy Problem
We say that an institution shares back
your personal infons if it gives you access to them. This technical term will
make the exposition easier. Institutions may be reluctant to share back
personal information, and they may
have reasonable justifications: the privacy of other people needs to be protected, there are security concerns,
there are competition concerns. But
there are numerous safe scenarios
where the chances are negligible that
sharing back your personal infons
would violate the privacy of another
person or damage the legitimate interests of the information holding institution or any other institution.
The inverse privacy problem is the
inaccessibility to you of your personal
information in such safe scenarios.
Q: Give me examples of safe scenarios.
A: Your favorite supermarket has
plentiful data about your shopping
there. Do you have that data?
Q: No, I dont.
A: But, in principle you could have.
So how can sharing that data with you
hurt anybody? Similarly, many other
businesses and government institutions have information about you
that you could in principle have but
in fact you do not. Some institutions
share a part of your inversely private
information with you but only a part.
For example, Fitbit sends you weekly
summaries but they have much more
information about you.
As a rule, individuals
are entitled to access
their personal infons.
Q: As you mentioned earlier, institutions have not only raw data about
me but also derived information. I can
imagine that some of that derived information may be sensitive.
A: Yes, there may be a part of your
inversely private information that is
too sensitive to be shared with you. Our
position is, however, that the burden
of proof is on the information-holding
institution.
Q: You use judicial terminology. But
who is the judge here?
A: The ultimate judge is society.
Q: Let me raise another point. Enabling me to access my inversely private information makes it easier for intruders to find information about me.
A: This is true. Any technology invented to allow inverse privacy information to be shared back has to be
made secure. Communication channels have to be secure, encryption has
to be secure, and so forth. Note, however, that today hackers are in a much
better position to find your inversely
private information about you than you
are. Sharing that information with you
should improve the situation.
Going Forward
As we pointed out previously, the inverse privacy problem is not simply
the result of ill will of governments
or businesses. It is primarily a side effect of technological progress. Technology influences the social norms of
privacy.17 In the case of inverse privacy, technology created the problem,
and technology is instrumental in
solving it. Here, we argue that the inverse privacy problem can be solved
and will be solved. By default we restrict attention to the safe scenarios
described previously.
Social norms. Individuals would
greatly benefit from gaining access
to their inversely private infons. They
will have a much fuller picture of their
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
41
viewpoints
FTC report. Here are some additional
laws and FTC reports of relevance:
A 2000 report of an FTC Advisory
Committee on providing online consumers reasonable access to personal
information collected from and about
them by domestic commercial Web
sites, and maintaining adequate security for that information7;
The 2003 Fair and Accurate Credit
Transactions Act providing consumers
with annual free credit reports from
the three nationwide consumer credit
reporting companies;5
Californias Shine the light law
of 2003, according to which a business
cannot disclose your personal information secretly from you to a third party
for direct marketing purposes2; and
A 2014 FTC report that calls for
laws making data broker practices
more transparent and giving consumers greater control over their personal information.9
Clearly the law favors transparency
and facilitates your access to your inverse private infons.
Market forces. The sticky point is
whether companies will share back
our personal information. This information is extremely valuable to them.
It gives them competitive advantages,
and so it may seem implausible that
companies will share it back. We contend that companies will share back
personal information because it will be
in their business interests.
Sharing back personal information
can be competitively advantageous
as well. Other things being equal,
wouldnt you prefer to deal with a company that shares your personal infons
with you? We think so. Companies will
compete on (a) how much personal
data, collected and derived, is shared
back and (b) how convenient that data
is presented to customers.
The evolution toward sharing back
personal information seems slow. This
will change. Once some companies
start sharing back personal data as part
of their routine business, the competitive pressure will quickly force their
competitors to join in. The competitors will have little choice.
There is money to be made in solving
the inverse privacy problem. As sharing back personal information gains
ground, the need will arise to mine large
amounts of customers personal data
42
There is money
to be made in
solving the inverse
privacy problem.
| J U LY 201 6 | VO L . 5 9 | NO. 7
DOI:10.1145/ 2909493
Should You
Upload or
Ship Big Data
to the Cloud?
IT IS ACCEPTED wisdom that when the data you wish
| J U LY 201 6 | VO L . 5 9 | NO. 7
practice
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
45
practice
over the Internet is closer than you
might think.
To illustrate, lets say you have 1TB of
business data to migrate to cloud storage from your self-managed datacenter.
You are signed up with a business plan
with your ISP that guarantees you an upload speed of 50Mbps and a download
speed of 10 times as much. All you need
to do is announce a short system-downtime window and begin hauling your
data up to the cloud. Right?
Not quite.
For starters, you will need a whopping 47 hours to finish uploading 1TB
of data at a speed of 50Mbpsand
thats assuming your connection never drops or slows down.
If you upgrade to a fastersay,
100Mbpsupload plan, you can fin-
Your Server
SATA/SAS/
PCI Express/
Thunderbolt etc.
e.g.
USB/WiFi/SATA/
EthernetNIC/
Thunderbolt/PCI
Express-to-gigabit
Ethernet/PCI
Express-to-fiber channel
etc.
Disk
Controller
Host
Controller
Host
A directly
pluggable
Drive
Source
Disk
Optical Fiber/
Copper Cable/
Wireless
Disk
Controller
Host
Controller
Host
SATA/SAS/
PCI Express/
Thunderbolt etc.
Host
Controller
Host
Controller
e.g.
USB/WiFi/SATA/
EthernetNIC/
Thunderbolt/PCI
Express-to-gigabit
Ethernet/PCI
Express-to-fiber
channel etc.
Storage Appliance
46
Interface Type
SATA Revision 3
617
SAS-3
12 10
1020
Thunderbolt 2
201
COMMUNICATIO NS O F TH E AC M
| J U LY 201 6 | VO L . 5 9 | NO. 7
practice
months to upload. Uploading one petabyte at 800Mbps should keep you going for four months.
Its time to consider an alternative.
Ship It!
That alternative is copying the data to
a storage appliance and shipping the
appliance to the datacenter, at which
end the data is copied to cloud storage. This is the Ship It! strategy. Under
what circumstances is this a viable alternative to uploading the data directly
into the cloud?
The mathematics of shipping data.
When data is read out from a drive, it
travels from the physical drive hardware (for example, the HDD platter)
to the on-board disk controller (the
electronic circuitry on the drive).
From there the data travels to the host
controller (a.k.a. the host bus adapter, a.k.a. the interface card) and finally to the host system (for example,
the computer with which the drive is
interfaced). When data is written to
the drive, it follows the reverse route.
When data is copied from a server to
a storage appliance (or vice versa), the
data must travel through an additional
physical layer, such as an Ethernet or
USB connection existing between the
server and the storage appliance.
Figure 1 is a simplified view of the
data flow when copying data to a storage appliance. The direction of data
flow shown in the figure is conceptually reversed when the data is copied
out from the storage appliance to the
cloud server.
Note that often the storage appliance may be nothing more than a
single hard drive, in which case the
data flow from the server to this drive
is basically along the dotted line in
the figure.
Given this data flow, a simple way
to express the time needed to transfer
the data to the cloud using the Ship
It! strategy is shown in Equation 1,
where: Vcontent is the volume of data to
be transferred in megabytes (MB).
Speed copyIn is the sustained rate
in MBps (megabytes per second) at
which data is copied from the source
drives to the storage appliance. This
speed is essentially the minimum
of three speeds: the speed at which
the controller reads data out of the
source drive and transfers it to the
host computer with which it interfaces; the speed at which the storage
appliances controller receives data
from its interfaced host and writes it
into the storage appliance; and the
speed of data transfer between the
two hosts. For example, if the two
hosts are connected over a Gigabit
Ethernet or a Fibre Channel connection, and the storage appliance is capable of writing data at 600MBps, but
if the source drive and its controller
can emit data at only 20MBps, then
the effective copy-in speed can be at
most 20MBps.
Speed copyOut is similarly the sustained rate in MBps at which data is
copied out of the storage appliance
and written into cloud storage.
Ttransit is the transit time for the
shipment via the courier service from
source to destination in hours.
Toverhead is the overhead time in hours.
This can include the time required to
buy the storage devices (for example,
tapes), set them up for data transfer,
pack and create the shipment, and drop
it off at the shippers location. At the receiving end, it includes the time needed
to process the shipment received from
the shipper, store it temporarily, unpack it, and set it up for data transfer.
The use of sustained data-transfer
rates. Storage devices come in a variety
of types such as HDD, SSD, and LTO.
Each type is available in different configurations such as a RAID (redundant
array of independent disks) of HDDs
or SSDs, or an HDD-SSD combination
where one or more SSDs are used as a
fast read-ahead cache for the HDD array. There are also many different datatransfer interfaces such as SCSI (Small
Computer System Interface), SATA
(Serial AT Attachment), SAS (Serial Attached SCSI), USB, PCI (Peripheral
Component Interconnect) Express,
Thunderbolt, and so on. Each of these
interfaces supports a different theoretical maximum data-transfer speed.
Figure 2 lists the data-transfer
speeds supported by a recent edition
of some of these controller interfaces.
The effective copy-in/copy-out
speed while copying data to/from a
storage appliance depends on a number of factors:
Type of drive. For example, SSDs
are usually faster than HDDs partly
because of the absence of any mov-
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
47
practice
ing parts. Among HDDs, higher-RPM
drives can exhibit lower seek times
than lower-RPM drives. Similarly,
3500
Time (Hours)
3000
2500
2000
1500
1000
500
0
200
400
600
800
Data Size (Terabytes)
1000
1200
Time (Hours)
20000
15000
10000
5000
0
200
400
600
800
1000
1200
3000
Time (Hours)
2500
2000
1500
1000
500
0
48
200
COMMUNICATIO NS O F TH E AC M
400
600
800
Data Size (Terabytes)
| J U LY 201 6 | VO L . 5 9 | NO. 7
1000
1200
practice
point drops sharply to 59 terabytes.
That means if the content size is
59TB or higher, it will be quicker just
to ship the data to the cloud provider
than to upload it using a fiber-based
ISP running at 800Mbps.
Figure 5 shows the comparative
growth in data-transfer time between
uploading it at 800Mbps versus copying it at a 320MBps transfer rate and
shipping it overnight.
This analysis brings up two key
questions:
If you know how much data you
wish to upload, what is the minimum
sustained upload speed your ISP must
provide, below which you would be
better off shipping the data via overnight courier?
If your ISP has promised you a certain sustained upload speed, beyond
what data size will shipping the data
be a quicker way of hauling it up to the
cloud than uploading it?
2500
2000
1500
320Mbps
1000
240Mbps
160Mbps
500
0
200
400
600
800
1000
1200
1400
Data Size (Terabytes)
600Mbps
1200
1000
800
600
240Mbps
400
320Mbps
160Mbps
200
0
500
1000
1500
2000
2500
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
49
practice
Also assume there is 1TB of data to be
transferred to the cloud.
The aforementioned substitution
reveals that unless the ISP provides a
sustained upload speed (Speed upload)
of at least 34.45Mbps, the data can be
transferred faster using a Ship It! strategy that involves an LTO-6 tape-based
data transfer running at 160MBps and
a shipping and handling overhead of
64 hours.
Figure 6 shows the relationship
between the volume of data to be
transferred (in TB) and the minimum
sustained ISP upload speed (in Mbps)
needed to make uploading the data as
fast as shipping it to the datacenter.
For very large data sizes, the threshold
ISP upload speed becomes less sensitive to the data size and more sensitive
to the drive-to-drive copy-in/copy-out
speeds with which it is competing.
Now lets attempt to answer the
second question. This time, assume
Speed upload (in Mbps) is the maximum
sustained upload speed that the ISP
can provide. What is the maximum
data size beyond which it will be
quicker to ship the data to the datacenter? Once again, recall that Equation 1 helps estimate the time required
(Transfer Time)hours to ship the data
to the datacenter for a given data size
(Vcontent MB) and drive-to-drive copy-in/
copy-out speeds. If you were instead
to upload Vcontent MB at Speedupload Mbps
over a network link, you would need
8 Vcontent/Speed upload hours. At a certain threshold value of Vcontent, these
two transfer times (shipping versus
upload) will become equal. Equation
1 can be rearranged to express this
threshold data size as illustrated in
Equation 2.
Figure 7 shows the relationship between this threshold data size and the
available sustained upload speed from
the ISP for different values of drive-todrive copy-in/copy-out speeds.
Equation 2 also shows that, for a
given value of drive-to-drive copy-in/
copy-out speed, the upward trend
in Vcontent continues up to a point
where Speed upload = 8/Tdata copy, beyond which Vcontent becomes infinite,
meaning it is no longer possible
to ship the data more quickly than
simply uploading it to the cloud, no
matter how gargantuan the data size.
In this case, unless you switch to a
50
COMMUNICATIO NS O F TH E AC M
Data can be
transferred at
various levels of
granularity such
as logical objects,
buckets, byte blobs,
files, or simply
a byte stream.
| J U LY 201 6 | VO L . 5 9 | NO. 7
practice
provide one of the lowest cost-tostorage ratios, compared with other
options such as HDD or SSD storage. Its easy to see, however, the total cost of tape cartridges becomes
prohibitive for storing terabyte and
beyond content sizes. One option is
to store data in a compressed form.
LTO-6, for example, can store up
to 6.25TB per tape 18 in compressed
format, thereby leading to fewer
tape cartridges. Compressing the
data at the source and uncompressing it at the destination, however,
further reduces the effective copyin/copy-out speed of LTO tapes, or
for that matter with any other storage medium. As explained earlier,
a low copy-in/copy-out speed can
make shipping the data less attractive than uploading it over a fiberbased ISP link.
But what if the cloud-storage provider loaned the storage appliance
to you? This way, the provider can
potentially afford to use higher-end
options such as high-end SSDs or a
combination HDD-SSD array in the
storage appliance, which would otherwise be prohibitively expensive
to purchase just for the purpose of
transferring data. In fact, that is exactly the approach that Amazon appears to have taken with its Amazon
Web Services (AWS) Snowball. 3 Amazon claims that up to 50TB of data
can be copied from your data source
into the Snowball storage appliance
in less than one day. This performance characteristic translates into
a sustained data-transfer rate of at
least 600MBps. This kind of a datatransfer rate is possible only with
very high-end SSD/HDD drive arrays
with read-ahead caches operating
over a fast interface such as SATA Revision 3, SAS-3, or PCI Express, and a
Gigabit Ethernet link out of the storage appliance.
In fact, the performance characteristics of AWS Snowball closely resemble those of a high-performance NAS
(network-attached storage) device,
complete with a CPU, on-board RAM,
built-in data encryption services,
Gigabit Ethernet network interface,
and a built-in control programnot
to mention a ruggedized, tamperproof construction. The utility of services such as Snowball comes from
the cloud provider making a very highperformance (and expensive) NASlike device available to users to rent
to copy-in/copy-out files to the providers cloud. Other major cloud providers such as Google and Microsoft are
not far behind in offering such capabilities. Microsoft requires you to
ship SATA II/III internal HDDs for
importing or exporting data into/
from the Azure cloud and provides
the software needed to prepare the
drives for import or export.16 Google,
on the other hand, appears to have
outsourced the data-copy service to a
third-party provider.8
One final point on the cost: unless your data is in a self-managed
datacenter, usually the source cloud
provider will charge you for data
egress,4,5,12,15 whether you do a diskbased copying out of data or cloudto-cloud data transfer. These charges
are usually levied on a per-GB, perTB, or per-request basis. There is
usually no data ingress charge levied
by the destination cloud provider.
Conclusion
If you wish to move big data from one
location to another over the Internet,
there are a few options available
namely, uploading it directly using
an ISP-provided network connection, copying it into a storage appliance and shipping the appliance to
the new storage provider, and, finally,
cloud-to-cloud data transfer.
Which technique you choose depends on a number of factors: the size
of data to be transferred, the sustained
Internet connection speed between
the source and destination servers, the
sustained drive-to-drive copy-in/copyout speeds supported by the storage
appliance and the source and destination drives, the monetary cost of data
transfer, and to a smaller extent, the
shipment cost and transit time. Some
of these factors result in the emergence of threshold upload speeds and
threshold data sizes that fundamentally influence which strategy you
would choose. Drive-to-drive copyin/copy-out times have enormous
influence on whether it is attractive
to copy and ship data, as opposed to
uploading it over the Internet, especially when competing with an opti
cal fiber-based Internet link.
Related articles
on queue.acm.org
How Will Astronomy Archives Survive the
Data Tsunami?
G. Bruce Berriman and Steven L. Groom
http://queue.acm.org/detail.cfm?id=2047483
Condos and Clouds
Pat Helland
http://queue.acm.org/detail.cfm?id=2398392
Why Cloud Computing Will Never Be Free
Dave Durkee
http://queue.acm.org/detail.cfm?id=1772130
References
1. Apple. Thunderbolt, 2015; http://www.apple.com/
thunderbolt/.
2. Amazon Web Services. Global infrastructure,
2015; https://aws.amazon.com/about-aws/globalinfrastructure/.
3. Amazon. AWS Import/Export Snowball, 2015;
https://aws.amazon.com/importexport/.
4. Amazon. Amazon S3 pricing. https://aws.amazon.
com/s3/pricing/.
5. Google. Google cloud storage pricing; https://cloud.
google.com/storage/pricing#network-pricing.
6. Google. Cloud storage transfer service, 2015; https://
cloud.google.com/storage/transfer/.
7. Google. Google fiber expansion plans; https://fiber.
google.com/newcities/.
8. Google. Offline media import/export, 2015; https://
cloud.google.com/storage/docs/offline-mediaimport-export.
9. Herskowitz, N. Microsoft named a leader in Gartners
public cloud storage services for second consecutive
year, 2015; https://azure.microsoft.com/en-us/blog/
microsoft-named-a-leader-in-gartners-public-cloudstorage-services-for-second-consecutive-year/.
10. SCSI Trade Association. Serial Attached SCSI
Technology Roadmap, (Oct. 14, 2015); http://www.
scsita.org/library/2015/10/serial-attached-scsitechnology-roadmap.html
11. IEEE. 802.3: Ethernet standards; http://standards.
ieee.org/about/get/802/802.3.html.
12. Microsoft. Microsoft Azure data transfers pricing
details; https://azure.microsoft.com/en-us/pricing/
details/data-transfers/.
13. Ookla. Americas fastest ISPs and mobile networks,
2015; http://www.speedtest.net/awards/us/kansascity-mo.
14. PCI-SIG. Press release: PCI Express 4.0 evolution
to 16GT/s, twice the throughput of PCI Express 3.0
technology; http://kavi.pcisig.com/news_room/Press_
Releases/November_29_2011_Press_Release_/.
15. Rackspace. Rackspace public cloud pay-as-you-go
pricing, 2015; http://www.rackspace.com/cloud/
public-pricing.
16. Shahan, R. Microsoft Corp. Use the Microsoft Azure
import/export service to transfer data to blob storage,
2015; https://azure.microsoft.com/en-in/documentation/
articles/storage-import-export-service/.
17. The Serial ATA International Organization. SATA
naming guidelines, 2015; https://www.sata-io.org/
sata-naming-guidelines.
18. Ultrium LTO. LTO-6 capacity data sheet; http://www.
lto.org/wp-content/uploads/2014/06/ValueProp_
Capacity.pdf.
19. Ultrium LTO. LTO-6 performance data sheet;
http://www.lto.org/wp-content/uploads/2014/06/
ValueProp_Performance.pdf.
20. USB Implementers Forum. SuperSpeed USB (USB
3.0) performance to double with new capabilities,
2013; http://www.businesswire.com/news/
home/20130106005027/en/SuperSpeed-USB-USB3.0-Performance-Double-Capabilities.
Sachin Date (https://in.linkedin.com/in/sachindate)
looks after the Microsoft and cloud applications portfolio
of e-Emphasys Technologies. Previously, he worked as
a practice head for mobile technologies, an enterprise
software architect, and a researcher in autonomous
software agents.
Copyright held by author.
Publication rights licensed to ACM $15.00.
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
51
practice
DOI:10.1145/ 2909468
The Small
Batches
Principle
DevOps people mean when they talk about
small batches?
A: To answer that, lets take a look at a chapter from
the upcoming book The Practice of System and Network
Administration due out later this year.
One of the themes the book explores is the small
batches principle: it is better to do work in small batches
than big leaps. Small batches permit us to deliver
results faster, with higher quality and less stress.
I begin with an example that has nothing to do
with system administration in order to demonstrate
the general idea. Then focus on three IT-specific
examples to show how the method applies and
the benefits that follow.
Q: W HAT D O
52
| J U LY 201 6 | VO L . 5 9 | NO. 7
cess. If something happens infrequently, theres always an excuse to put off automating it. Also, there would be fewer
changes to the infrastructure between
releases. If an infrastructure change did
break the release automation, it would
be easier to fix the problem.
The change did not happen overnight. First the developers changed
their methodology from mega releases
with many new features, to small iterations, each with a few specific new
features. This was a big change, and
selling the idea to the team and management was a long battle.
Meanwhile, the operations team
automated the testing and deployment processes. The automation
could take the latest code, test it, and
deploy it into the beta-test area in less
than an hour. The push to production
was still manual, but by reusing code
for the beta rollouts it became increasingly less manual over time.
The result was the beta area was updated multiple times a day. Since it was
automated, there was little reason not
to. This made the process continuous,
instead of periodic. Each code change
triggered the full testing suite, and
problems were found in minutes rather than in months.
Pushes to the production area happened monthly because they required
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
53
practice
coordination among engineering,
marketing, sales, customer support,
and other groups. That said, all of
these teams loved the transition from
an unreliable mostly every-six-months
schedule to a reliable monthly schedule. Soon these teams started initiatives to attempt weekly releases, with
hopes of moving to daily releases. In
the new small-batch world the following benefits were observed:
Features arrived faster. While in
the past a new feature took up to six
months to reach production, now it
could go from idea to production in
days.
Hell month was eliminated. After
hundreds of trouble-free pushes to
beta, pushing to production was easy.
The operations team could focus
on higher-priority projects. The team
was no longer directly involved in software releases other than fixing the automation, which was rare. This freed up
the team for more important projects.
There were fewer impediments to
fixing bugs. The first step in fixing a
bug is to identify which code change
was responsible. Big-batch releases
had hundreds or thousands of changes
to sort through to identify the guilty
party. With small batches, it was usually quite obvious where to find the bug.
Bugs were fixed in less time. Fixing a bug in code that was written six
months ago is much more difficult
than fixing a bug in code while it is
still fresh in your mind. Small batches
meant bugs were reported soon after
the code was written, which meant developers could fix them more expertly
in a shorter amount of time.
Developers experienced instant
gratification. Waiting six months to
see the results of your efforts is demoralizing. Seeing your code help people
shortly after it was written is addictive.
Most importantly, the operations
team could finally take long vacations,
the kind that require advance planning
and scheduling, thus giving them a way
to reset and live healthier lives.
While these technical benefits are
worthwhile, the business benefits are
even more exciting:
Their ability to compete improved. Confidence in the ability to
add features and fix bugs led to the
company becoming more aggressive
about new features and fine-tuning
54
| J U LY 201 6 | VO L . 5 9 | NO. 7
practice
issues were fixed. Code was changed,
better pretests were developed, and
drills gave each member of the SRE
(site reliability engineering) team a
chance to learn the process. Eventually
the overall process was simplified and
easier to automate. The benefits Stack
Overflow observed included:
Fewer surprises. The more frequent the drills, the smoother the process became.
Reduced risk. The procedure was
more reliable because there were fewer
hidden bugs waiting to bite.
Higher confidence. The company
had more confidence in the process,
which meant the team could now focus
on more important issues.
Bugs fixed faster. The smaller accumulation of infrastructure and code
changes meant each drill tested fewer
changes. Bugs were easier to identify
and faster to fix.
Bugs fixed during business hours.
Instead of having to find workarounds
or implement fixes at odd hours when
engineers were sleepy, they were
worked on during the day when engineers were there to discuss and implement higher-quality fixes.
Better cross training. Practice
makes perfect. Operations team members all had a turn at doing the process
in an environment where they had help
readily available. No person was a single point of failure.
Improved
process documentation and automation. Documentation
improved while the drill was running.
Automation was easier to write because the repetition helped the team
see what could be automated or what
pieces were most worth automating.
New opportunities revealed. The
drills were a big source of inspiration
for big-picture projects that would radically improve operations.
Happier developers. There was less
chance of being woken up at 4 a.m.
Happier operations team. The fear
of failovers was reduced, leading to
less stress. More people trained in the
failover procedure meant less stress
on the people who had previously been
single points of failure.
Better morale. Employees could
schedule long vacations again.
Again, it became easier to schedule
long vacations.
The concept
of activating
the failover
procedure on
a system that was
working perfectly
may seem odd,
but it is better
to discover
bugs and
other problems in
a controlled
situation than
during an
emergency.
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
55
practice
however, it would be easy to throw
away the mistakes. This would enable
the team to pivot, meaning they could
change direction based on recent results. It is better to pivot early in the development process than to realize well
into it that you have built something
nobody likes.
Google calls this launch early and
often. Launch as early as possible
even if that means leaving out most of
the features and launching to only a
few select users. What you learn from
the early launches informs the decisions later on and produces a better
service in the end.
Launching early and often also gives
you the opportunity to build operational infrastructure early. Some companies build a service for a year and then
launch it, informing the operations
team only a week prior. IT then has little time to develop operational practices such as backups, on-call playbooks,
and so on. Therefore, those things are
done badly. With the launch-early-andoften strategy, you gain operational
experience early and you have enough
time to do it right.
This is also known as the MVP strategy. As defined by Eric Ries in 2009, The
minimum viable product is that version
of a new product which allows a team to
collect the maximum amount of validated learning about customers with the
least effort (Minimum Viable Product:
A Guide; http://www.startuplessonslearned.com/2009/08/minimum-viableproduct-guide.html). In other words,
rather than focusing on new functionality in each release, focus on testing an
assumption in each release.
The team building the monitoring
system adopted the launch-early-andoften strategy. They decided that each
iteration, or small batch, would be
one week long. At the end of the week
they would release what was running
in their beta environment to their production environment and ask for feedback from stakeholders.
For this to work they had to pick very
small chunks of work. Taking a cue
from Jason Punyon and Kevin Montrose (Providence: Failure Is Always
an Option; http://jasonpunyon.com/
blog/2015/02/12/providence-failureis-always-an-option/), they called this
What can get done by Friday?-driven
development.
56
| J U LY 201 6 | VO L . 5 9 | NO. 7
practice
More importantly, dashboards could be
configured and customized by individual users. They were self-service.
After much discussion, the team
decided to pivot to the other software
package. In the next iteration, they
set up the new software and created
an equivalent set of configurations.
This went very quickly because a lot
of work from the previous iterations
could be reused: the decisions on
what and how to monitor; the work
completed with the network team;
and so on.
By iteration 6, the entire team was
actively using the new software. Managers were setting up dashboards to display key metrics that were important to
them. People were enthusiastic about
the new system.
Something interesting happened
around this time: a major server
crashed on Saturday morning. The
monitoring system alerted the sysadmin team, who were able to fix the
problem before staff arrived at the office on Monday. In the past there had
been similar outages but repairs had
not begun until the sysadmins arrived
on Monday morning, well after most
employees had arrived. This showed
management, in a very tangible way,
the value of the system.
Iteration 7 had the goal of writing
a proposal to move the monitoring
system to physical machines so that
it would scale better. By this time the
managers who would approve such a
purchase were enthusiastically using
the system; many had become quite
expert at creating custom dashboards.
The case was made to move the system
to physical hardware for better scaling
and performance, and to use a duplicate set of hardware for a hot spare site
in another datacenter.
The plan was approved.
In future iterations the system became more valuable to the organization as the team implemented features such as a more sophisticated
on-call schedule, more monitored services, and so on. The benefits of small
batches observed by the sysadmin
team included:
Testing assumptions early prevents
wasted effort. The ability to fail early
and often means the team can pivot.
Problems can be fixed sooner rather
than later.
Summary
Why are small batches better?
Small batches result in happier customers. Features get delivered with less
latency. Bugs are fixed faster.
Small batches reduce risk. By testing assumptions, the prospect of future failure is reduced. More people
get experience with procedures, which
mean our skills improve.
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
57
practice
DOI:10.1145/ 2890780
Statistics
for Engineers
collect an increasing wealth
of data from network gear, operating systems,
applications, and other components. This data needs
to be analyzed to derive vital information about the
user experience and business performance. For
instance, faults need to be detected, service quality
needs to be measured and resource usage of the next
days and month needs to be forecasted.
MODER N IT SYST E MS
| J U LY 201 6 | VO L . 5 9 | NO. 7
barrier of entry. Even worse, these courses often focus on parametric methods,
such as t-tests, that are inadequate for
this kind of analysis since they rely on
strong assumptions on the distribution
of data (for example, normality) that are
not met by operations data.
This lack of relevance of classical,
parametric statistics can be explained
by history. The origins of statistics
reach back to the 17th century, when
computation was expensive and data
was a sparse resource, leading mathematicians to spend a lot of effort to
avoid calculations.
Today the stage has changed radically and allows different approaches
to statistical problems. Consider this
example from a textbook2 used in a
university statistics class:
A fruit merchant gets a delivery of
10,000 oranges. He wants to know how
many of those are rotten. To find out he
takes a sample of 50 oranges and counts
the number of rotten ones. Which deductions can he make about the total number
of rotten oranges?
The chapter goes on to explain various inference methods. The example
translated to the IT domain could go as:
A DB admin wants to know how many
requests took longer than one second to
complete. He measures the duration of
all requests and counts the number of
those that took longer than one second.
Done.
The abundance of computing resources has completely eliminated the
need for elaborate estimations.
Therefore, this article takes a different approach to statistics. Instead of
presenting textbook material on inference statistics, we will walk through
four sections with descriptive statistical
methods that are accessible and relevant to the case in point. I will discuss
several visualization methods, gain a
precise understanding of how to summarize data with histograms, visit classical summary statistics, and see how
to replace them with robust, quantilebased alternatives. I have tried to keep
prerequisite mathematical knowledge
to a minimum (for example, by provid-
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
59
practice
Rug plots are suitable for all questions where the temporal ordering of
the samples is not relevant, such as
common values or outliers. Problems
occur if there are multiple samples with
the same sample value in the dataset.
Those samples will be indistinguishable in the rug plot. This problem can
be addressed by adding a small random
displacement (jitter) to the samples.
Despite its simple and honest character, the rug plot is not commonly
used. Histograms or line plots are used
instead, even if a rug plot would be
more suitable.
Histograms. The histogram is a
popular visualization method for one-
600
800
1000
1200
1400
1600
1800
2000
2200
Sample Density
0.003
0.002
0.001
800
600
1000
1200
1400
1600
1800
2000
60
70
2200
2200
2000
1800
1600
1400
1200
1000
800
600
10
20
30
40
50
Time-offset in Minutes
60
| J U LY 201 6 | VO L . 5 9 | NO. 7
80
practice
3.5
Node-2 Request Rate in rps
3.0
2.5
2.0
1.5
1.0
0.5
0
2
3
Node-1 Request Rate in rps
2
3
Node-1 Request Rate in rps
3.5
3.0
Histograms
Histograms in IT operations have two
different roles: as visualization method
and as aggregation method.
To gain a complete understanding
of histograms, lets start by building
one for the Web request-rate data discussed previously. The listing in Figure
5 contains a complete implementation, discussed step by step here.
1. The first step in building a histogram is to choose a range of values
that should be covered. To make this
choice you need some prior knowledge about the dataset. Minimum and
maximum values are popular choices
in practice. In this example the value
range is [500, 2200].
2. Next the value range is partitioned into bins. Bins are often of
equal size, but there is no need to follow this convention. The bin partition
is represented here by a sequence of
bin boundaries (line 4).
3. Count how many samples of the
given dataset are contained in each
bin (lines 613). A value that lies on the
boundary between two bins will be assigned to the higher bin.
4. Finally, produce a bar chart, where
each bar is based on one bin, and the
bar height is equal to the sample count
divided by the bin width (lines 1416).
The division by bin width is an important normalization, since otherwise
the bar area is not proportional to the
sample count. Figure 5 shows the resulting histogram.
Different choices in selecting the
range and bin boundaries of a histogram
can affect its appearance considerably.
Figure 6 shows a histogram with 100 bins
for the same data. Note that it closely
resembles a rug plot. On the other extreme, choosing a single bin would result in a histogram with a single bar with
a height equal to the sample density.
2.5
2.0
1.5
1.0
0.5
0
1
2
3
4
5
6
7
8
9
10
11
0.08
0.06
0.04
0.02
0
600
800
1000
1200
1400
1600
1800
2000
2200
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
61
practice
sults when applied to operations data,
such as request latencies, that contain
many outliers.
Histogram as aggregation method.
When measuring high-frequency data
such as I/O latencies, which can arrive
at rates of more than 1,000 samples per
second, storing all individual samples
is no longer feasible. If you are willing
to forget about ordering and sacrifice
Figure 6. Histogram plot with value range (500, 2,200) and 100 equally sized bins.
0.30
Sample Density
0.25
0.20
0.15
0.10
0.05
0
600
800
1000
1200
1400
Request Rate in rps
1600
1800
2000
400
600
800
1000
1200
1400
1600
1800
2000
2200
22:00
23:00
0:00
Request Latency in ms
100
80
262.8
100
60
40
20
0
16:00
17:00
18:00
19:00
20:00
21:00
62
20
COM MUNICATIO NS O F TH E AC M
40
60
80
| J U LY 201 6 | VO L . 5 9 | NO. 7
100
120
practice
0.09
6h view range:
all samples
are visible
0.08
0.07
ms
0.06
0.05
0.04
0.03
0.02
0.01
Nov 21
0:00
Nov 21
2:00
Nov 21
4:00
Nov 21
6:00
0.09
0.08
0.07
0.06
ms
0.05
0.04
0.03
0.02
0.01
Nov 20
12:00
Nov 21
0:00
Nov 21
12:00
Nov 22
0:00
Nov 22
12:00
mean value
max deviation
0.02
0.06
0.04
Latency in ms
(x1,,xn) = n ni=1 xi
or when expressed as Python code:
def mean(X): return sum(X) / len(X)
0.08
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
63
practice
Figure 12. The cumulative distribution function for a dataset of request rates.
1.0
0.8
0.6
0.4
0.2
500
1000
1500
Request Rates in rps
2000
2500
10
Latency in ms
June
19
June
20
June
21
June
22
June
23
June
24
June
25
June
26
Figure 14. Histogram metric with inverse quantile CF D(3ms) over 1H windows.
80
60
40
20
June
19
64
June
20
June
21
COMMUNICATIO NS O F TH E AC M
June
22
June
23
June
24
| J U LY 201 6 | VO L . 5 9 | NO. 7
June
25
June
26
100
Latency in ms (Heatmap)
10
practice
The mean absolute deviation is defined as mad (x1,,xn) = n1 ni=1|xi |
and is the most direct mathematical
translation of a typical deviation from
the mean.
The standard deviation is defined as
stddev(x1,,xn) = n1 ni=1(xi )2
While the intuition behind this definition is not obvious, this deviation
measure is very popular for its nice
mathematical properties (as being derived from a quadratic form). In fact,
all three of these deviation measures
fit into a continuous family of p-deviations,11 which features the standard
deviation in a central position.
Figure 11 shows the mean value and
all three deviation measures for a request latency dataset. You can immediately observe the following inequalities:
mad(x1,,xn) stddev(x1,,xn)
maxdev(x1,,xn)
This relation can be shown to hold
true in general.
The presence of outliers affects all
three deviation measures significantly.
Since operations data frequently contains outliers, the use of these diviation
measures should be taken with caution
or avoided all together. More robust
methods (for example, interquartile
rangesa) are based on quartiles, which
we discuss below.
Caution with the standard deviation.
Many of us remember the following
rule of thumb from school:
68% of all samples lie within one
standard deviation of the mean.
95% of all samples lie within two
standard deviations of the mean.
99.7% of all samples lie within
three standard deviations of the mean.
These assertions rely on the crucial
assumption the data is normally distributed. For operations data this is almost never the case, and the rule fails
quite drastically: in the previous example more than 0.97% lie within one
standard deviation of the mean value.
The following war story can be
found in P.K. Janerts book:10
An SLA (service level agreement) for
a database defined a latency outlier as
a value outside of three standard deviations. The programmer who implemented
the SLA check remembered the above rule
a http://en.wikipedia.org/wiki/Interquartile_range
Despite their
abundance, mean
values lead to a
variety of problems
when measuring
performance of
services.
65
practice
at least q n samples are less than
or equal to y;
at least (1 q) n samples are greater than or equal to y.
Familiar examples are the minimum, which is a 0-quantile; the maximum, which is a 1-quantile; and the
median, which is a 0.5-quantile. Common names for special quantiles include percentiles for k/100-quantiles
and quartiles for k/4-quantiles.
Note that quantiles are not unique.
There are ways of making them
unique, but those involve a choice that
is not obvious. Wikipedia lists nine
different choices that are found in
common software products.13 Therefore, if people talk about the q-quantile or the median, one should always
be careful and question which choice
was made.
As a simple example of how quantiles are non-unique, take a dataset with
two values X = [10,20]. Which values are
medians, 0-quantiles, 0.25-quantiles?
Try to figure it out yourself.
The good news is that q-quantiles
always exist and are easy to compute.
Indeed, let S be a sorted copy of the
dataset X such that the smallest element X is equal to S[0] and the largest element of X is equal to S[n 1].
If d = floor(q (n 1)), then S[d] will
have d+1 samples S[0],...,S[d], which
are less than or equal to S[d], and n
d + 1 samples S[d],..., S[n], which
are greater than or equal to S[d]. It
follows that S[d] = y is a q-quantile.
The same argument holds true for
d = ceil(q (n 1)).
The following listing is a Python implementation of this construction:
COMMUNICATIO NS O F TH E AC M
| J U LY 201 6 | VO L . 5 9 | NO. 7
ACMs Career
& Job Center
Visit ACMs
contributed articles
DOI:10.1145/ 2856103
FormulaBased
Software
Debugging
creative activity, poses
strict demands on its human practitioners in terms
of precision, and even talented programmers make
mistakes. The effect of a mistake can manifest in
several waysas a program crash, data corruption, or
unexpected output. Debugging is the task of locating the
root cause of an error from its observable manifestation.
It is a challenge because the manifestation of an error
might become observable in a programs execution
much later than the point at which the error infected
the program state in the first place. Stories abound of
heroic efforts required to fix problems that cropped up
unexpectedly in software that was previously considered
to be working and dependable.
Given the importance of efficient debugging in overall
software productivity, computer-assisted techniques
PROGRAMMING, THOUGH A
68
COMMUNICATIO NS O F TH E AC M
| J U LY 201 6 | VO L . 5 9 | NO. 7
key insights
IMAGE BY CEPERA
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
69
contributed articles
controlled preservation is achieved
through a static approximation of control and data dependencies. Dynamic
analysis1 applies the same idea but to
an execution of the program on a specific input; the advantage is only that
the control and data dependencies in
the execution are used in the computation of the slice, leading to a more succinct and precise slice.
The second significant idea is delta
debugging7,24 in which a programmer
tries to isolate the cause of a failure by
systematically exploring deviations
from a non-failure scenario. For example, if a new version of code breaks while
the old version works, one can systematically try to isolate the specific change
in the program that can be held responsible for the failure; the same idea also
applies to program executions. Delta
debugging takes advantage of compute
cycles by systematically exploring a
large number of program variations.
The third idea we highlight is statistical fault isolation,12,15 which looks
at execution profiles of passing and
failing tests. If execution of a statement
is strongly correlated (in a statistical
sense) with only the failing tests, it is
ranked highly in its suspiciousness.
Such ideas shift the burden of lo-
Name
Symbolic Technique
Information from
BugAssist13
Program Formula
Internal inconsistency
Error Invariants10
Interpolants
Internal inconsistency
Angelic Debugging5
Passing tests
Darwin22
Previous version
COMMUNICATIO NS O F TH E ACM
| J U LY 201 6 | VO L . 5 9 | NO. 7
contributed articles
plication domains, programmers do
not write down detailed specifications
of the intended behavior of their code.
Protocol specifications, even when
available, are usually at a much higher
level than the level of the implementation described here. In this particular
case, when Test 2 fails, how does the
programmer infer where the program
execution turned incorrect?
The execution trace of Test 2 is as
follows
0 hw _ set = 0; hw = NULL; ap = NULL;
1 while (i = getopt(....)) {
2 switch (i) {
3
case 'A':
4
ap = getaftype(optarg);
5
hw _ set = 1;
6
break;
11 } // exit switch statement
1 while (i = getopt(....)) {
12 } // exit while loop
13 if (hw _ set == 0)
{ // this condition is false
16 assert(hw != NULL);
// assertion fails
Fortunately,
it turns out
some notion of
intended behavior
of the program
can be recovered
through
indirect means.
Using Satisfiability
The technique we discuss first is called
BugAssist13 in which the input and
desired output information from a failing test are encoded as constraints.
These input-output constraints are then
conjoined with the program formula,
which encodes the operational semantics of the program symbolically along
all paths; see Figure 2 for an overview
of conversion from a program to a formula. In the example program of Figure
1, we produce the following formula ():
= arg[1] = A arg[2] = INET arg[3]
= NULL
hw_set0 = 0 hw0 = NULL ap0 =
NULL
i1 = arg[1] i1 NULL
guard3 = (i1 = = A)
ap4 = arg[2]
hw_set5 = 1
ap11 = guard3 ? ap4 : ap0
hw_set11 = guard3 ? hw_set5 : hw_set0
i1 = arg[3] i1 = = NULL
guard13 = (hw_set11 = = 0)
hw14 = DEFAULT
hw15 = guard13 ? hw14 : hw0
hw15 NULL hw15 = = DEFAULT
The arg elements refer to the input
values, similar to argv of the C language; here, the inputs are for Test 2.
The last line of clauses represents the
expectation of Test 2. The remainder
of the formula represents the program
logic. (For brevity, we have omitted the
parts of the formula corresponding to
the case 'H', as it does not matter for
this input.) The variable i1 refers to the
second time the loop condition while
is evaluated, at which point the loop
exits. We use = to indicate assignment
and == to indicate equality test in the
program, though for the satisfiability of
the formula both have the same meaning.
The formula , though lengthy, has
one-to-one correspondence to the trace
of Test 2 outlined earlier. Since the
test input used corresponds to a failing
test, the formula is unsatisfiable.
The BugAssist tool tries to infer
what went wrong by trying to make a
large part of the formula satisfiable,
accomplishing it through MAX-SAT19
or MAX-SMT solvers. As the name suggests, a MAX-SAT solver returns the
largest possible satisfiable sub-formula of a formula; the sub-formula omits
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
71
contributed articles
Figure 2. Conversion of program to formula; the formula encodes, using guards, possible
final valuations of variables.
1
2
3
4
5
6
7
8
9
input y; // initially x = z = 0
if (y > 0){
z = y * 2;
x = y - 2;
x = x - 2; }
if (z == x)
output(How did I get here);
else if (z > x)
output(Error);
In it, variables are given a subscript based on the line on which an instance is assigned.
Guard variables denote conditions that regulate values of variables when potentially different
values of a variable reach a branch point. For example, guard2 regulates the value of z6 based
on whether the default initial value or the assignment to z3 reaches it.
expected
output
p
false
Ip
straint corresponds to a good fix. Realizing this potential regression, the programmer using BugAssist would mark
guard13 = (hw_set11 = = 0) as a hard constraint. Marking as a hard constraint
indicates to the solver it should explore
other lines in the program as possible
causes of error.
The MAX-SAT solver will then highlight another (set of) constraints, say,
hw_set5 = 1, meaning just this constraint can be omitted (or changed) to
make the overall formula satisfiable.
The reader can verify this forces guard13
to be true. This corresponds to identifying line 5 in Figure 1 as the place to be
fixed. Here is the correct fix
| J U LY 201 6 | VO L . 5 9 | NO. 7
contributed articles
logic formulae X and Y, an interpolant
is a formula I satisfying
X => I => Y
The formula I is expressed through
the common vocabulary of X and Y.
Figure 3 outlines the use of interpolants for analyzing error traces. The error trace formula is a logical conjunction of the input valuation, the effect of
each program statement on the trace,
and the expectation about output.
Given any position p in the trace, if
we denote the formula from locations
prior to p as X, and the formula from locations in the trace after p as Y, clearly X
Y is false. Thus (X Y) holds, meaning
X Y holds, meaning X => Y holds.
The interpolant Ip at position p in the error trace will thus satisfy X => Ip => Y.
Such an interpolant can be computed
at every position p in the error trace for
understanding the reason behind the
observable error.
Let us now understand the role of
interpolants in software debugging.
We first work out our simple example,
then conceptualize use of the logical
formula captured by interpolant in
explaining software errors. In our running example, the interpolants at the
different positions are listed in Table
2, including interpolant formula after
each statement. Note there are many
choices of interpolants at any position in the trace, and we have shown
the weakest interpolant in this simple
example. The trace here again corresponds to the failing execution of Test
2 on the program in Figure 1, and we
used the same statements earlier in
this article on BugAssist.
What role do interpolants play in explaining an error trace? To answer, we
examine the sequence of interpolants
computed for the error trace in our
simple example program, looking at
the second column in Table 2 and considering only non-repeated formulae:
arg[1] = A
arg[1] = A hw0 = NULL
i1 = A hw0 = NULL
guard3 = true hw0 = NULL
guard3 = true hw0 = NULL hw_set5
=1
hw0 = NULL hw_set11 = 1
hw0 = NULL guard13 = false
hw15 = NULL
arg[1] = A
arg[1] = A
arg[3] = NULL
hw_set = 0
arg[1] = A
arg[2] = INET
arg[1] = A
arg[1] = A
hw0 = NULL
ap0 = NULL
i1 = arg[1]
guard3 = (i1 == A)
i1 = A hw0 = NULL
guard3 = true hw0 = NULL
guard3 = true hw0 = NULL
ap4 = arg[2]
hw_set5 = 1
i1 = arg[3]
guard13 = (hw_set11 == 0)
hw14 = DEFAULT
have
hw15 = NULL
hw15 NULL
false
73
contributed articles
tions for which changing the expression
would make previously passing tests
fail. It does so without knowing any proposed candidate fix, again because the
landscape of syntactic fixes is so vast;
rather, it works on just the basis of a
candidate fix location.
Consider again the failing execution of Test 2 on the program in Figure 1. We illustrate how the technique
works, focusing first on statement 5.
The technique conceptually replaces
the right-hand-side expression by a
hole denoted by !!, meaning an asyet-unknown expression
The scalability of
symbolic analysisbased debugging
methods crucially
depends on the
scalability of SMT
constraint solving.
5: hw_ set = !!
COM MUNICATIO NS O F TH E AC M
| J U LY 201 6 | VO L . 5 9 | NO. 7
13: if (!!) {
contributed articles
following fix at line 5 works for tests 13
5: hw_ set = 0
The astute reader will notice this particular fix is not an ideal fix. Given
another test
Test 4: arp H ETHER A INET
Expected output: ETHER
Actual output: DEFAULT
e2T
z=y*2
x
=
0,
z
=
2,
y
=
;
>
0
x=y-2
x
=
-2,
z
=
2,
y
=
;
>
0
(y > 0)
y > 0
x
=
0,
z
=
0,
y
=
;
>
0
x
=
0,
z
=
0,
y
=
;
0
z==
x
x
=
0,
z
=
0,
y
=
;
0
e2F,6T
e2F
x=x-2
x
=
-4,
z
=
2,
y
=
;
>
0
z
==x
z != x
x
=
-4,
z
=
2,
y
=
;
>
0
2
==
-4
e2T,6T
x
=
-4,
z
=
2,
y
=
;
>
0
2
!=
-4
e2T,6F
At line 6, e2T forks into: e2T,6T and e2T,6F. e2T,6T will be [x = 4, z = * 2, y = ; > 0 4 = * 2],
which will be discarded since the condition is unsatisfiable.
Symbolic execution tree construction is similar to the program formula construction in Figure 2. For this
reason, it is also called static symbolic execution. The difference is, in program formula, threads were
merged with control-flow join points, whereas in symbolic execution tree, there is no merging.
Logical Formula
1
2
3
4
5
6
8
9
Note this form of symbolic execution is similar to the one in Figure 4; the difference is one path
is already given here, so there is no execution tree to be explored. This is sometimes also called
dynamic symbolic execution.
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
75
contributed articles
rent program P is taken as the specification against which the behavior of
the failing test t is compared for difference in control flow.
Such methods are based on semantic analysis, rather than a completely syntactic analysis of differences across program versions (such
as running a diff between program
versions). Being based on semantic
analysis these debugging methods
can analyze two versions with substantially different implementations
and locate causes of error.
To illustrate this point, consider
the fixed Address Resolution Protocol (ARP) implementationFigure
1 with line 5 deletedwe discussed
earlier as the reference version. This
program will pass the test Test 2.
Now assume a buggy program implementation with a substantially different programming style but with intention to accomplish the same ARP
(see Figure 6). The test Test 2 fails in
this implementation
Test 2:
arp A INET (failing test).
Expected output: DEFAULT
Observed output: INET
1. arg[1] = A arg[1] A
2. arg[1] = A (arg[2] NULL
arg[1] = H)
f arg[1] = A
The path condition in the buggy implementation is as follows (since i is set to
arg[1] and ap is set to arg[2] (via optarg)
f arg[1] = A (arg[2] NULL arg[1] = H)
The negation of f is the following disjunction
f arg[1] A (arg[2] NULL arg[1] = H)
f f thus has two possibilities to
consider, one for each disjunct in f
Figure 6. Assume line 5 in Figure 1 was removed yielding the correct program for the code
fragment; here we show a different implementation of the same code.
76
| J U LY 201 6 | VO L . 5 9 | NO. 7
contributed articles
sion debugging methods of Banerjee
et al.3 and Qi et al.22 localized the error to within 10 lines of code in fewer
than 10 minutes.
Among the techniques covered
here,
the
regression-debugging
methods3,22 have shown the greatest scalability, with the other techniques being employed on small-tomoderate-scale programs. Moreover,
the scalability of symbolic analysisbased debugging methods crucially
depends on the scalability of SMT
constraint solving.8 Compared to statistical fault-localization techniques,
which are easily implemented, symbolic execution-based debugging
methods still involve more implementation effort, as well as greater
execution time overheads. While we
see much promise due to the growth
in SMT solver technology, as partly
evidenced by the scalability of the
regression-debugging
methods,
more research is needed in symbolic
analysis and SMT constraint solving
to enhance the scalability and applicability of these methods.
Note for all of the presented debugging
methods,
professional
programmers need user studies to
measure programmer productivity
gain that might be realized through
these methods. Parnin and Orso21
highlighted the importance of user
studies in evaluating debugging
methods. The need for user studies
may be even more acute for methods
like BugAssist that provide an iterative exploration of the possible error
causes, instead of providing a final
set of diagnostic information capturing the lines likely to be the causes of
errors. Finally, for the interpolantbased debugging method, the issue
of choosing suitable interpolants
needs further study, a topic being investigated today (such as by Albarghouthi and McMillan2).
Other directions. Related to our
topic of using symbolic execution for
software debugging, we wish to say
symbolic execution can also be useful
for bug reproduction, as shown by
Jin and Orso.11 The bug-reproduction
problem is different from both test
generation and test explanation. Here,
some hints may be reported about the
failing execution by on-the-field users
in the form of a crash report, and these
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
77
contributed articles
DOI:10.1145/ 2854146
Why Google
Stores Billions
of Lines
of Code
in a Single
Repository
key insights
78
| J U LY 201 6 | VO L . 5 9 | NO. 7
both the interactive use case, or human users, and automated use cases.
Larger dips in both graphs occur during holidays affecting a significant
number of employees (such as Christmas Day and New Years Day, American Thanksgiving Day, and American
Independence Day).
In October 2012, Googles central
repository added support for Windows
and Mac users (until then it was Linuxonly), and the existing Windows and
Mac repository was merged with the
main repository. Googles tooling for
repository merges attributes all historical changes being merged to their original authors, hence the corresponding
bump in the graph in Figure 2. The effect of this merge is also apparent in
Figure 1.
The commits-per-week graph shows
the commit rate was dominated by
human users until 2012, at which
point Google switched to a custom-
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
79
contributed articles
Figure 1. Millions of changes committed to Googles central repository over time.
30 M
20 M
10 M
Jan. 2000
Jan. 2005
Jan. 2010
Jan. 2015
10,000
5,000
Jan. 2010
Jan. 2011
Jan. 2012
Jan. 2013
Jan. 2014
Jan. 2015
Human commits
Total commits
300,000
225,000
150,000
75,000
Jan. 2010
80
1 billion
Jan. 2011
Jan. 2012
Jan. 2013
| J U LY 201 6 | VO L . 5 9 | NO. 7
Jan. 2014
Jan. 2015
2 billion
Depth of history
35 million commits
Size of content
86TB
40,000
contributed articles
the Google repository was the main
motivation for developing Piper.
Since Googles source code is one of
the companys most important assets,
security features are a key consideration in Pipers design. Piper supports
file-level access control lists. Most of
the repository is visible to all Piper
users;d however, important configuration files or files including businesscritical algorithms can be more tightly
controlled. In addition, read and write
access to files in Piper is logged. If sensitive data is accidentally committed
to Piper, the file in question can be
purged. The read logs allow administrators to determine if anyone accessed the problematic file before it
was removed.
In the Piper workflow (see Figure 4),
developers create a local copy of files in
the repository before changing them.
These files are stored in a workspace
owned by the developer. A Piper workspace is comparable to a working copy
in Apache Subversion, a local clone
in Git, or a client in Perforce. Updates
from the Piper repository can be pulled
into a workspace and merged with ongoing work, as desired (see Figure 5).
A snapshot of the workspace can be
shared with other developers for review. Files in a workspace are committed to the central repository only after
going through the Google code-review
process, as described later.
Most developers access Piper
through a system called Clients in
the Cloud, or CitC, which consists of
a cloud-based storage backend and a
Linux-only FUSE13 file system. Developers see their workspaces as directories in the file system, including their
changes overlaid on top of the full
Piper repository. CitC supports code
browsing and normal Unix tools with
no need to clone or sync state locally.
Developers can browse and edit files
anywhere across the Piper repository, and only modified files are stored
in their workspace. This structure
means CitC workspaces typically consume only a small amount of storage
(an average workspace has fewer than
10 files) while presenting a seamless
view of the entire Piper codebase to
the developer.
All writes to files are stored as snapshots in CitC, making it possible to recover previous stages of work as needed. Snapshots may be explicitly named,
restored, or tagged for review.
CitC workspaces are available on
any machine that can connect to the
cloud-based storage system, making
it easy to switch machines and pick
up work without interruption. It also
makes it possible for developers to
view each others work in CitC workspaces. Storing all in-progress work in
the cloud is an important element of
the Google workflow process. Working state is thus available to other
tools, including the cloud-based build
system, the automated test infrastructure, and the code browsing, editing,
and review tools.
Sync user
workspace
to repo
Write code
Code
review
Commit
Figure 5. Piper team logo Piper is Piper expanded recursively; design source: Kirrily
Anderson.
81
contributed articles
a comment). Then, without leaving
the code browser, they can send their
changes out to the appropriate reviewers with auto-commit enabled.
Piper can also be used without CitC.
Developers can instead store Piper
workspaces on their local machines.
Piper also has limited interoperability
with Git. Over 80% of Piper users today
use CitC, with adoption continuing to
grow due to the many benefits provided by CitC.
Piper and CitC make working productively with a single, monolithic
source repository possible at the scale
of the Google codebase. The design
and architecture of these systems were
both heavily influenced by the trunkbased development paradigm employed at Google, as described here.
Trunk-based development. Google
practices trunk-based development on
top of the Piper source repository. The
vast majority of Piper users work at the
head, or most recent, version of a
single copy of the code called trunk
or mainline. Changes are made to
the repository in a single, serial ordering. The combination of trunk-based
development with a central repository
defines the monolithic codebase model. Immediately after any commit, the
new code is visible to, and usable by,
all other developers. The fact that Piper
users work on a single consistent view
of the Google codebase is key for providing the advantages described later
in this article.
Trunk-based development is beneficial in part because it avoids the painful merges that often occur when it is
time to reconcile long-lived branches.
Development on branches is unusual
and not well supported at Google,
though branches are typically used
for releases. Release branches are cut
from a specific revision of the repository. Bug fixes and enhancements that
must be added to a release are typically
developed on mainline, then cherrypicked into the release branch (see
Figure 6). Due to the need to maintain
stability and limit churn on the release
branch, a release is typically a snapshot of head, with an optional small
number of cherry-picks pulled in from
head as needed. Use of long-lived
branches with parallel development
on the branch and mainline is exceedingly rare.
82
COMMUNICATIO NS O F TH E ACM
| J U LY 201 6 | VO L . 5 9 | NO. 7
contributed articles
before being committed to the repository. Most developers can view and
propose changes to files anywhere
across the entire codebasewith the
exception of a small set of highly confidential code that is more carefully
controlled. The risk associated with
developers changing code they are
not deeply familiar with is mitigated
through the code-review process and
the concept of code ownership. The
Google codebase is laid out in a tree
structure. Each and every directory
has a set of owners who control whether a change to files in their directory
will be accepted. Owners are typically
the developers who work on the projects in the directories in question. A
change often receives a detailed code
review from one developer, evaluating
the quality of the change, and a commit approval from an owner, evaluating
the appropriateness of the change to
their area of the codebase.
Code reviewers comment on aspects of code quality, including design, functionality, complexity, testing,
naming, comment quality, and code
style, as documented by the various
language-specific Google style guides.e
Google has written a code-review tool
called Critique that allows the reviewer
to view the evolution of the code and
comment on any line of the change.
It encourages further revisions and a
conversation leading to a final Looks
Good To Me from the reviewer, indicating the review is complete.
Googles static analysis system (Tricorder10) and presubmit infrastructure
also provide data on code quality, test
coverage, and test results automatically in the Google code-review tool. These
computationally intensive checks are
triggered periodically, as well as when
a code change is sent for review. Tricorder also provides suggested fixes
with one-click code editing for many
errors. These systems provide important data to increase the effectiveness
of code reviews and keep the Google
codebase healthy.
A team of Google developers will
occasionally undertake a set of widereaching code-cleanup changes to further maintain the health of the codebase. The developers who perform
these changes commonly separate
Trunk/Mainline
Cherry-pick
Release branch
15,000
10,000
5,000
Jan. 2011
Jan. 2012
Jan. 2013
Jan. 2014
Jan. 2015
e https://github.com/google/styleguide
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
83
contributed articles
Analysis
This section outlines and expands
upon both the advantages of a monolithic codebase and the costs related to
maintaining such a model at scale.
Advantages. Supporting the ultralarge-scale of Googles codebase while
maintaining good performance for
tens of thousands of users is a challenge, but Google has embraced the
monolithic model due to its compelling advantages.
Most important, it supports:
Unified versioning, one source of
truth;
Extensive code sharing and reuse;
Simplified dependency management;
Atomic changes;
Large-scale refactoring;
Collaboration across teams;
Flexible team boundaries and code
ownership; and
Code visibility and clear tree
structure providing implicit team
namespacing.
A single repository provides unified
versioning and a single source of truth.
There is no confusion about which repository hosts the authoritative version
of a file. If one team wants to depend
on another teams code, it can depend
on it directly. The Google codebase includes a wealth of useful libraries, and
the monolithic repository leads to extensive code sharing and reuse.
The Google build system5 makes it
easy to include code across directories, simplifying dependency management. Changes to the dependencies
of a project trigger a rebuild of the
dependent code. Since all code is ver-
84
COMMUNICATIO NS O F TH E AC M
D.1
D.2
| J U LY 201 6 | VO L . 5 9 | NO. 7
contributed articles
ing it to C++11 or rolling out performance optimizations9) are often managed centrally by dedicated codebase
maintainers. Such efforts can touch
half a million variable declarations or
function-call sites spread across hundreds of thousands of files of source
code. Because all projects are centrally stored, teams of specialists can do
this work for the entire company, rather than require many individuals to
develop their own tools, techniques,
or expertise.
As an example of how these benefits play out, consider Googles Compiler team, which ensures developers
at Google employ the most up-to-date
toolchains and benefit from the latest improvements in generated code
and debuggability. The monolithic
repository provides the team with
full visibility of how various languages are used at Google and allows them
to do codebase-wide cleanups to prevent changes from breaking builds or
creating issues for developers. This
greatly simplifies compiler validation,
thus reducing compiler release cycles
and making it possible for Google to
safely do regular compiler releases
(typically more than 20 per year for the
C++ compilers).
Using the data generated by performance and regression tests run on
nightly builds of the entire Google
codebase, the Compiler team tunes default compiler settings to be optimal.
For example, due to this centralized
effort, Googles Java developers all saw
their garbage collection (GC) CPU consumption decrease by more than 50%
and their GC pause time decrease by
10%40% from 2014 to 2015. In addition, when software errors are discovered, it is often possible for the team
to add new warnings to prevent reoccurrence. In conjunction with this
change, they scan the entire repository to find and fix other instances of
the software issue being addressed,
before turning to new compiler errors. Having the compiler-reject patterns that proved problematic in the
past is a significant boost to Googles
overall code health.
Storing all source code in a common
version-control repository allows codebase maintainers to efficiently analyze and change Googles source code.
Tools like Refaster11 and ClangMR15
An important aspect
of Google culture
that encourages
code quality is
the expectation
that all code is
reviewed before
being committed
to the repository.
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
85
contributed articles
The monolithic model makes it
easier to understand the structure of
the codebase, as there is no crossing of
repository boundaries between dependencies. However, as the scale increases, code discovery can become more
difficult, as standard tools like grep
bog down. Developers must be able
to explore the codebase, find relevant
libraries, and see how to use them
and who wrote them. Library authors
often need to see how their APIs are
being used. This requires a significant investment in code search and
browsing tools. However, Google has
found this investment highly rewarding, improving the productivity of all
developers, as described in more detail
by Sadowski et al.9
Access to the whole codebase encourages extensive code sharing and
reuse. Some would argue this model,
which relies on the extreme scalability of the Google build system, makes
it too easy to add dependencies and
reduces the incentive for software developers to produce stable and wellthought-out APIs.
Due to the ease of creating dependencies, it is common for teams to not think
about their dependency graph, making
code cleanup more error-prone. Unnecessary dependencies can increase
project exposure to downstream build
breakages, lead to binary size bloating,
and create additional work in building
and testing. In addition, lost productivity ensues when abandoned projects
that remain in the repository continue
to be updated and maintained.
Several efforts at Google have
sought to rein in unnecessary dependencies. Tooling exists to help identify
and remove unused dependencies, or
dependencies linked into the product binary for historical or accidental
reasons, that are not needed. Tooling
also exists to identify underutilized
dependencies, or dependencies on
large libraries that are mostly unneeded, as candidates for refactoring.7 One
such tool, Clipper, relies on a custom
Java compiler to generate an accurate
cross-reference index. It then uses the
index to construct a reachability graph
and determine what classes are never
used. Clipper is useful in guiding dependency-refactoring efforts by finding
targets that are relatively easy to remove
or break up.
86
COMMUNICATIO NS O F TH E AC M
A developer can
make a major
change touching
hundreds or
thousands of
files across the
repository in a
single consistent
operation.
| J U LY 201 6 | VO L . 5 9 | NO. 7
contributed articles
by teams that need to review an ongoing stream of simple refactorings resulting from codebase-wide clean-ups
and centralized modernization efforts.
Alternatives
As the popularity and use of distributed version control systems (DVCSs)
like Git have grown, Google has considered whether to move from Piper
to Git as its primary version-control
system. A team at Google is focused
on supporting Git, which is used by
Googles Android and Chrome teams
outside the main Google repository.
The use of Git is important for these
teams due to external partner and open
source collaborations.
The Git community strongly suggests and prefers developers have
more and smaller repositories. A Gitclone operation requires copying all
content to ones local machine, a procedure incompatible with a large repository. To move to Git-based source
hosting, it would be necessary to split
Googles repository into thousands of
separate repositories to achieve reasonable performance. Such reorganization
would necessitate cultural and workflow changes for Googles developers.
As a comparison, Googles Git-hosted
Android codebase is divided into more
than 800 separate repositories.
Given the value gained from the existing tools Google has built and the
many advantages of the monolithic
codebase structure, it is clear that moving to more and smaller repositories
would not make sense for Googles
main repository. The alternative of
moving to Git or any other DVCS that
would require repository splitting is
not compelling for Google.
Current investment by the Google
source team focuses primarily on the
ongoing reliability, scalability, and
security of the in-house source systems. The team is also pursuing an
experimental effort with Mercurial,g
an open source DVCS similar to Git.
The goal is to add scalability features to the Mercurial client so it can
efficiently support a codebase the
size of Googles. This would provide
Googles developers with an alternative of using popular DVCS-style workflows in conjunction with the central
g http://mercurial.selenic.com/
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
87
contributed articles
DOI:10.1145/ 2851485
>4
An Improved
Lower Bound
on the Growth
Constant of
Polyominoes
What is ? The universal constant arises in the study
of three completely unrelated fields: combinatorics,
percolation, and branched polymers. In combinatorics,
analysis of self-avoiding walks (SAWs, or non-selfintersecting lattice paths starting at the origin, counted
by lattice units), simple polygons or self-avoiding
polygons (SAPs, or closed SAWs, counted by either
perimeter or area), and polyominoes (SAPs possibly
with holes, or edge-connected sets of lattice squares,
counted by area) are all related. In statistical physics,
SAWs and SAPs play a significant role in percolation
processes and in the collapse transition that branched
polymers undergo when being heated. A collection
edited by Guttmann15 gives an excellent review of all
88
| J U LY 201 6 | VO L . 5 9 | NO. 7
these topics and the connections between them. In this article, we describe
our effort to prove that the growth constant (or asymptotic growth rate, also
called the connective constant) of
polyominoes is strictly greater than 4.
To this aim, we exploited to the maximum possible computer resources
key insights
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
89
contributed articles
Figure 1. The single monomino (A(1) = 1),
the two dominoes (A(2) = 2), the A(3) = 6
triominoes, and the A(4) = 19 tetrominoes
(Tetris pieces).
COMMUNICATIO NS O F TH E ACM
ing a porous solid, and similar processes, representing space as a lattice with
two distinct types of cells.
In the literature of statistical physics,
fixed polyominoes are usually called
strongly embedded lattice animals,
and in that context, the analogue of
the growth rate of polyominoes is the
growth constant of lattice animals.
The terms high and low temperature
mean high and low density of clusters, respectively, and the term free
energy corresponds to the natural
logarithm of the growth constant. Lattice animals were used for computing
the mean cluster density in percolation
processes, as in Gaunt et al.,12 particularly in processes involving fluid flow in
random media. Sykes and Glen29 were
the first to observe that A(n), the total
number of connected clusters of size n,
grows asymptotically like Cnn, where
is Klarners constant and C, are two
other fixed values.
Collapse of branched polymers. Another important topic in statistical
physics is the existence of a collapse
transition of branched polymers in
dilute solution at a high temperature.
In physics, a field is an entity each of
whose points has a value that depends
on location and time. Lubensky and
Isaacson22 developed a field theory of
branched polymers in the dilute limit,
using statistics of (bond) lattice animals (important in the theory of percolation) to imply when a solvent is good
or bad for a polymer. Derrida and Herrmann10 investigated two-dimensional
branched polymers by looking at lattice animals on a square lattice and
studying their free energy. Flesia et al.11
made the connection between collapse
processes to percolation theory, relating the growth constant of strongly embedded site animals to the free energy
in the processes. Several models of
branched polymers in dilute solution
were considered by Madras et al.,24
proving bounds on the growth constants for each such model.
Brief History. Determining the exact
value of (or even setting good bounds
on it) is a notably difficult problem in
enumerative combinatorics. In 1967,
Klarner19 showed that the limit =
limn A(n)1/n exists. Since then, has
been called Klarners constant. Only
in 1999, Madras23 proved the stronger
statement that the asymptotic growth
| J U LY 201 6 | VO L . 5 9 | NO. 7
contributed articles
ing of spheres is the densest possible.
This proof, in addition to involving
extensive case enumerations at different levels, is also very complicated in
the interaction between the various
parts. In August 2014, a team headed
by Hales announced the completion
of the Flyspeck project, constructing a
formal proof of Keplers conjecture.16
Yet another example is the proof by
Tucker30 for Lorenzs conjecture (number 14 in Smales list of challenging
problems for the 21st century). The
conjecture states that Lorenzs system
of three differential equations, providing a model for atmospheric convection, supports a strange attractor.
Tucker (page 104)30 described the run
of a parallel ODE solver several times
on different computer setups, obtaining similar results.
Certified Computation. This technique is based on the idea it may be
easier to check a given answer for correctness than come up with such an
answer from scratch. The prototype
example is the solution of an equation like 3x3 4x2 5x + 2 = 0. While
it may be a challenge to find the solution x = 2, it is straightforward to substitute the solution into the equation
and check whether it is fulfilled. The
result is trustworthy not because it is
accompanied by a formal proof but
because it is so simple to check, much
simpler than the algorithm (perhaps
Newton iteration in this case) used to
solve the problem in the first place.
In addition to the solution itself, it
may be required that a certifying algorithm provide a certificate in order
to facilitate the checking procedure.25
Developing such certificates and computing them without excessive overhead may require algorithmic innovations (see the first sidebar Certified
Computations).
In our case, the result computed
by our program can be interpreted
as the eigenvalue of a giant matrix
Q, which is not explicitly stored but
implicitly defined through an iteration procedure. The certificate is a
vector v that is a good-enough approximation of the corresponding eigenvector. From this vector, one can
compute certified bounds on the eigenvalue in a rather straightforward
way by comparing the vector v to the
product Qv.
Certified Computations
A certifying algorithm not only produces the result but also justifies its correctness by
supplying a certificate that makes it easy to check the result. In contrast with formally
verified computation, correctness is established for a particular instance of the
problem and a concrete result. Here, we illustrate this concept with a few examples; see
the survey by McConnell et al.25 for a thorough treatment.
The greatest common divisor. The greatest common divisor of two numbers can be
found through the ancient Euclidean algorithm. For example, the greatest common
divisor of 880215 and 244035 is 15. Checking that 15 is indeed a common divisor is
rather easy, but not clear is that 15 is the greatest. Luckily, the extended Euclidean
algorithm provides a certificate: two integers p = 7571 and q = 27308, such that
880215p + 244035q = 15. This proves any common divisor of 880215 and 244035 must
divide 15. No number greater than 15 can thus divide 880215 and 244035.
Systems of linear equations and inequalities. Consider the three equations 4x 3y
+ z = 2, 3x y + z = 3, and x 7y z = 4 in three unknowns x, y, z. It is straightforward
to verify any proposed solution; however, the equations have no solution. Multiplying
them by 4, 5, 1, respectively, and adding them up leads to the contradiction 0 = 11.
The three multipliers thus provide an easy-to-check certificate for the answer. Such
multipliers can always be found for an unsolvable linear system and can be computed
as a by-product of the usual solution algorithms. A well-known extension of this
example is linear programming, the optimization of a linear objective function subject
to linear equations and inequalities, where an optimality certificate is provided by the
dual solution.
Testing a graph for 3-connectedness. Certifying that a graph is not 3-connected is
straightforward. The certificate consists of two vertices whose removal disconnects the
graph into several pieces. It has been known since 1973 that a graph can be tested for
3-connectedness in linear time,17 but all algorithms for this task are complicated. While
providing a certificate in the negative case is easy, defining an easy-to-check certificate
in the positive case and finding such a certificate in linear time has required graphtheoretic and algorithmic innovations.28 This example illustrates that certifiability
is not primarily an issue of running time. All algorithms, for constructing, as well
as for checking, the certificate, run in linear time, just like classical non-certifying
algorithms. The crucial distinction is that the short certificate-checking program is by
far simpler than the construction algorithm.
Comparison with the class NP. There is some analogy between a certifying
algorithm and the complexity class NP. For membership in NP, it is necessary only
to have a certificate by which correctness of a solution can be checked in polynomial
time. It does not matter how difficult it is to come up with the solution and find the
certificate. In contrast with a certifying algorithm, the criterion for the checker is not
simplicity but more clear-cutrunning time. Another difference is that only positive
answers to the problem need to be certified.
3
4
W=5
5
6
1
2
These two approaches complement each other; a simpler checking procedure is more amenable to a
formal proof and verification procedure. However, we did not go to such
lengths; a formal verification of the
program would have made sense only
in the context of a full verification that
includes also the human-readable
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
91
contributed articles
tion of a Motzkin path and the sidebar
for the relation between states and
Motzkin paths. Such a path connects
the integer grid points (0,0) and (n,0)
with n steps, consisting only of steps
taken from {(1,1), (1,0), (1,1)} and
not going under the x axis. Asymptotically, Mn ~ 3nn3/2, and MW thus increases roughly by a factor of 3 when W is
incremented by 1.
The number of polyominoes with
n cells that have state s as the boundary equals the number of paths the
automaton can take from the starting
state to s, involving n transitions in
which a cell is added to the polyomino. We compute these numbers in a
dynamic-programming recursion.
Figure 5. The dependence between the different groups of ynew and yold.
y old
G1
G2
G3
...
succ1[s]
y new
G1
G2
G3
...
Gi
Gi+ 1
...
G W1
GW
...
G W1
GW
succ0[s]
Gi
Gi+ 1
| J U LY 201 6 | VO L . 5 9 | NO. 7
Method
In 2004, a sequential program that
computes W for any perimeter was
developed by Ares Rib as part of her
Ph.D. thesis under the supervision
of author Gnter Rote. The program
first computes the endpoints of the
outgoing edges from all states of the
automaton and saves them in two
long arrays succ0 and succ1 that correspond to adding an empty or an occupied cell. Both arrays are of length
M := MW+1. Two successive iteration
vectors, which contain the number of
polyominoes corresponding to each
boundary, are stored as two arrays yold
and ynew of numbers, also of length M.
The four arrays are indexed from 0 to
M 1. After initializing yold := (1,0,0,),
each iteration computes the new version of y through a very simple loop:
yold := (1,0,0,);
repeat till convergence:
ynew[0] := 0;
for s := 1,,M 1:
() ynew[s] := ynew[succ0[s]]
+ yold[succ1[s]];
yold := ynew;
The pointer succ0[s] may be null, in
which case the corresponding zero entry (ynew[0]) is used.
As explained earlier, each index s
represents a state. The states are encoded by Motzkin paths, and these
paths can be mapped to numbers s between 0 and M 1 in a bijective manner. See the online appendix for more.
In the iteration (), the vector ynew
depends on itself, but this does not
contributed articles
cause any problem because succ0[s],
if it is non-null, is always less than s.
There are thus no circular references,
and each entry is set before it is used.
In fact, the states can be partitioned
into groups G1, G2, , GW; the group
Gi contains the states corresponding
to boundaries in which i is the smallest index of an occupied cell, or the
boundaries that start with i 1 empty
cells. The dependence between the
entries of the groups is shown schematically in Figure 5; succ0[s] of an entry s Gi (for 1 i W 1), if it is nonnull, belongs to Gi+1.
At the end, ynew is moved to yold to
start the new iteration. In the iteration (), the new vector ynew is a linear
function of the old vector yold and, as
already indicated, can be written as a
linear transformation yold := Qyold. The
nonnegative integer matrix Q is implicitly given through the iteration ().
We are interested in the growth rate of
the sequence of vectors yold that is determined by the dominant eigenvalue
W of Q. It is not difficult to show5 that
after every iteration, W is contained in
the interval
new
new
mins y [s] w maxs y [s] . (1)
old
y [s]
yold[s]
Representing Boundaries
as Motzkin Paths
The figure here illustrates the representation of polyomino boundaries as Motzkin
paths. The bottom part of the figure shows a partially constructed polyomino on
a twisted cylinder of width 16. The dashed line indicates two adjacent cells that
are connected around the cylinder, where such a connection is not immediately
apparent. The boundary cells (top row) are shown in darker gray. The light-gray
cells away from the boundary need not be recorded individually; what matters is the
connectivity among the boundary cells they provide. This connectivity is indicated
in a symbolic code -AAA-B-CC-AA--AA. Boundary cells in the same component are
represented by the same letter, and the character '-' denotes an empty cell. However,
we did not use this code in our program. Instead, we represented a boundary as a
Motzkin path, as shown in the top part of the figure, because this representation
allows for a convenient bijection to successive integers and thus for a compact storage
of the boundary in a vector. Intuitively, the Motzkin path follows the movements of
a stack when reading the code from left to right. Whenever a new component starts
(such as component A in position 2 or component B in position 6), the path moves up.
Whenever a component is temporarily interrupted (such as component A in position
5), the path also moves up. The path moves down when an interrupted component
is resumed (such as component A in positions 11 and 15) or when a component is
completed (positions 7, 10, and 17). The crucial property is that components cannot
cross; that is, a pattern like ..A..B..A..B.. cannot occur. As a consequence of
these rules, the occupied cells correspond to odd levels in the path, and the free cells
correspond to even levels. The correspondence between boundaries and Motzkin
paths was pointed out by Stefan Felsner of the Technische Universitt Berlin (private
communication).
Polyomino boundaries and Motzkin paths.
10
11
12
13
14
15
16
10
11
12
13
14
15
16
17
3
2
1
0
Motzkin path
code
1
Sequential Runs
In 2004, we obtained good approximations of yold up to W = 22. The program required quite a bit of main
memory (RAM) by the standards of
the time. The computation of 22
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
93
contributed articles
(see Figure 6), we estimated that only
when we reach W = 27 would we break
the mythical barrier of 4.0. However,
as mentioned earlier, the storage
requirement is proportional to MW,
and MW increases roughly by a factor of 3 when W is incremented by 1.
With this exponential growth of both
memory consumption and running
time, the goal of breaking the barrier
was then out of reach.
Computing 27
Environment. In the spring of 2013,
we were granted access to the Hewlett
Packard ProLiant DL980 G7 server of
the Hasso Plattner Institute Future
SOC Lab in Potsdam, Germany (see
Figure 7). This server consists of eight
Intel Xeon X7560 nodes (Intel64 architecture), each with eight physical
2.26GHz processors (16 virtual cores),
for a total of 64 processors (128 virtual
4
3.5
3
2.5
2
1.5
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Figure 7. Front view of the supercomputer we used, a box of approximately 453570 cm.
94
| J U LY 201 6 | VO L . 5 9 | NO. 7
contributed articles
Validity and Certification
Our proof depends heavily on computer calculations, raising two issues
about its validity:
Calculations. Reproducing elaborate calculations on a large computer
is difficult; particularly when a complicated parallel computer program is involved, everybody should be skeptical
about the reliability of the results; and
Computations. We performed the
computations with 32-bit floatingpoint numbers.
We address these issues in turn.
What our program was trying to
compute is an eigenvalue of a matrix. The amount and length of the
computations are irrelevant to the
fact that eventually we have stored
on disk a witness array of floatingpoint numbers (the proof), approximately 450GB in size, which is a good
approximation of the eigenvector
corresponding to 27. This array provides rigorous bounds on the true eigenvalue 27, because the relation (1)
holds for any vector yold and its successor vector ynew. To check the proof and
evaluate the bounds (1), a program
has to read only the approximate eigenvector yold and carry out one iteration (). This approach of providing
simple certificates for the result of
complicated computations is the philosophy of certifying algorithms.25
To ensure the correctness of our
checking program, we relied on traditional methods (such as testing and
code inspection). Some parts of the
program (such as reading the data
from the files) are irrelevant for the
correctness of the result. The main
body of the program consists of a
few simple loops (such as the iteration () and the evaluation of (1)). The
only technically challenging part of
the algorithm is the successor computation. For this task, we had two
programs at our disposal that were
written independently by two people
who used different state representationsand lived in different countries and did their work several years
apart. We ran two different checking
programs based on these procedures,
giving us additional confidence. We
also tested explicitly that the two successor programs yielded the same results. Both checking programs ran in
a purely sequential manner, and the
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
95
review articles
Todays social bots are sophisticated and
sometimes menacing. Indeed, their presence
can endanger online ecosystems as well
as our society.
BY EMILIO FERRARA, ONUR VAROL, CLAYTON DAVIS,
FILIPPO MENCZER, AND ALESSANDRO FLAMMINI
The Rise of
Social Bots
software robots) have been around
since the early days of computers. One compelling
example of bots is chatbots, algorithms designed to
hold a conversation with a human, as envisioned by
Alan Turing in the 1950s.33 The dream of designing a
computer algorithm that passes the Turing test has
driven artificial intelligence research for decades,
as witnessed by initiatives like the Loebner Prize,
awarding progress in natural language processing.a
Many things have changed since the early days of
AI, when bots like Joseph Weizenbaums ELIZA,39
mimicking a Rogerian psychotherapist, were
developed as demonstrations or for delight.
Today, social media ecosystems populated by
hundreds of millions of individuals present real
incentivesincluding economic and political ones
BOT S ( SHOR T F O R
a www.loebner.net/Prizef/loebner-prize.html
96
| J U LY 201 6 | VO L . 5 9 | NO. 7
key insights
ILLUSTRATION BY A RN0
DOI:10.1145/ 2818717
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
97
review articles
may result in several levels of damage
to society. For example, bots may artificially inflate support for a political
candidate;28 such activity could endanger democracy by influencing the
outcome of elections. In fact, this kind
of abuse has already been observed:
during the 2010 U.S. midterm elections, social bots were employed to
support some candidates and smear
their opponents, injecting thousands
of tweets pointing to websites with
fake news.28 A similar case was report-
This network visualization illustrates how bots are used to affect, and possibly manipulate, the online debate about vaccination policy.
It is the retweet network for the #SB277 hashtag, about a recent California law on vaccination requirements and exemptions. Nodes
represent Twitter users, and links show how information spreads among users. The node size represents influence (times a user is
retweeted), the color represents bot scores: red nodes are highly likely to be bot accounts, blue nodes are highly likely to be humans.
98
COM MUNICATIO NS O F TH E AC M
| J U LY 201 6 | VO L . 5 9 | NO. 7
review articles
its accuracy, is highly popular and
endorsed by many, exerting an influence against which we havent yet developed antibodies. Our vulnerability
makes it possible for a bot to acquire
significant influence, even unintentionally.2 Sophisticated bots can generate personas that appear as credible
followers, and thus are more difficult
for both people and filtering algorithms to detect. They make for valuable entities on the fake follower market, and allegations of acquisition of
fake followers have touched several
prominent political figures in the U.S.
and worldwide.
Journalists, analysts, and researchers increasingly report more
examples of the potential dangers
brought by social bots. These include
the unwarranted consequences that
the widespread diffusion of bots
may have on the stability of markets.
There have been claims that Twitter signals can be leveraged to predict the stock market,5 and there is
an increasing amount of evidence
showing that market operators pay
attention and react promptly to information from social media. On April
23, 2013, for example, the Syrian
Electronic Army hacked the Twitter
account of the Associated Press and
posted a false rumor about a terror
attack on the White House in which
President Obama was allegedly injured. This provoked an immediate
crash in the stock market. On May 6,
2010 a flash crash occurred in the U.S.
stock market, when the Dow Jones
plunged over 1,000 points (about 9%)
within minutesthe biggest one-day
point decline in history. After a fivemonth-long investigation, the role of
high-frequency trading bots became
obvious, but it yet remains unclear
whether these bots had access to information from the social Web.22
The combination of social bots
with an increasing reliance on automatic trading systems that, at least
partially, exploit information from social media, is ripe with risks. Bots can
amplify the visibility of misleading
information, while automatic trading
systems lack fact-checking capabilities. A recent orchestrated bot campaign successfully created the appearance of a sustained discussion about a
tech company called Cynk. Automatic
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM
99
review articles
appropriate content is identified,
the bots can automatically produce
responses through natural language
algorithms, possibly including references to media or links pointing to
external resources. Other bots aim
at tampering with the identities of
legitimate people: some are identity
thieves, adopting slight variants of
real usernames, and stealing personal information such as pictures and
links. Even more advanced mechanisms can be employed; some social
bots are able to clone the behavior
of legitimate users, by interacting
with their friends and posting topically coherent content with similar
temporal patterns.
A Taxonomy of Social Bot
Detection Systems
For all the reasons outlined here, the
computing community is engaging
in the design of advanced methods
to automatically detect social bots,
or to discriminate between humans
and bots. The strategies currently
employed by social media services appear inadequate to contrast this phenomenon and the efforts of the academic community in this direction
just started.
Here, we propose a simple taxonomy that divides the approaches proposed in literature into three classes:
bot detection systems based on social
network information; systems based
on crowdsourcing and leveraging
human intelligence; and, machinelearning methods based on the identification of highly revealing features
that discriminate between bots and
humans. Sometimes a hard categorization of a detection strategy into one
of these three categories is difficult,
since some exhibit mixed elements:
we present also a section of methods
that combine ideas from these three
main approaches.
The computing
community is
engaging in the
design of advanced
methods to
automatically
detect social bots,
or to discriminate
between humans
and bots.
CO MM UNICATIO NS O F T H E AC M
| J U LY 201 6 | VO L . 5 9 | NO. 7
review articles
solely from network structure information. This brought Alvisi et al.3 to recommend a portfolio of complementary
detection techniques, and the manual
identification of legitimate social network users to aid in the training of supervised learning algorithms.
Crowdsourcing
Social Bot Detection
Wang et al.38 have explored the possibility of human detection, suggesting the crowdsourcing of social bot
detection to legions of workers. As a
proof-of-concept, they created an Online Social Turing Test platform. The
authors assumed that bot detection is
a simple task for humans, whose ability to evaluate conversational nuances
like sarcasm or persuasive language,
or to observe emerging patterns and
anomalies, is yet unparalleled by machines. Using data from Facebook
and Renren (a popular Chinese online
social network), the authors tested
the efficacy of humans, both expert
annotators and workers hired online,
at detecting social bot accounts simply from the information on their profiles. The authors observed the detection rate for hired workers drops off
over time, although it remains good
enough to be used in a majority voting
protocol: the same profile is shown to
multiple workers and the opinion of
the majority determines the final verdict. This strategy exhibits a near-zero
false positive rate, a very desirable feature for a service provider.
Three drawbacks undermine the
feasibility of this approach: first, although the authors make a general
claim that crowdsourcing the detection of social bots might work if implemented since the early stage, this
solution might not be cost effective
for a platform with a large pre-existing
user base, like Facebook and Twitter.
Second, to guarantee that a minimal
number of human annotators can be
employed to minimize costs, expert
workers are still needed to accurately
detect fake accounts, as the average
worker does not perform well individually. As a result, to reliably build
a ground-truth of annotated bots,
large social network companies like
Facebook and Twitter are forced to
hire teams of expert analysts,30 however such a choice might not be suit-
Class
Description
Network
Network features capture various dimensions of information diffusion patterns. Statistical features can be extracted from networks based on retweets, mentions, and
hashtag co-occurrence. Examples include degree distribution, clustering coefficient,
and centrality measures.29
User
Friends
Timing
Timing features capture temporal patterns of content generation (tweets) and consumption (retweets); examples include the signal similarity to a Poisson process,18
or the average time between two consecutive posts.
Content
Content features are based on linguistic cues computed through natural language
processing, especially part-of-speech tagging; examples include the frequency of
verbs, nouns, and adverbs in tweets.
Sentiment
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T H E ACM
101
review articles
Figure 1. Common features used for social bot detection. (a) The network of hashtags co-occurring in the tweets of a given user. (b) Various
sentiment signals including emoticon, happiness and arousal-dominance-valence scores. (c) The volume of content produced and consumed (tweeting and retweeting) over time.
instantfollowback
Positive
Negative
91%
9%
100%
belieber
photos
haveaniceday
followtrain
50%
0%
50%
100%
beliebers
followback
jb
monday
10
justinbieber
hiking
Figure 2. User behaviors that best discriminate social bots from humans.
Social bots retweet more than humans and have longer user names, while they produce fewer tweets,
replies and mentions, and they are retweeted less than humans. Bot accounts also tend to be more recent.
Human
Social bot
No. retweets
Account age
No. tweets
No. replies
No. mentions
No. times retweeted
Username length
3
Z-score
| J U LY 201 6 | VO L . 5 9 | NO. 7
review articles
whereas Sybil accounts spend their
time harvesting profiles and befriending other accounts. Intuitively, social
bot activities tend to be simpler in
terms of variety of behavior exhibited.
By also identifying highly predictive
features such as invitation frequency,
outgoing requests accepted, and network clustering coefficient, Renren
is able to classify accounts into two
categories: bot-like and human-like
prototypical profiles.42 Sybil accounts
on Renren tend to collude and work
together to spread similar content:
this additional signal, encoded as
content and temporal similarity, is
used to detect colluding accounts. In
some ways, the Renren approach37,42
combines the best of network- and
behavior-based conceptualizations
of Sybil detection. By achieving good
results even utilizing only the last 100
click events for each user, the Renren
system obviates to the need to store
and analyze the entire click history
for every user. Once the parameters
are tweaked against ground truth,
the algorithm can be seeded with a
fixed number of known legitimate accounts and then used for mostly unsupervised classification. The Sybil
until proven otherwise approach
(the opposite of the innocent-byassociation strategy) baked into this
framework does lend itself to detecting previously unknown methods of
attack: the authors recount the case
of spambots embedding text in images to evade detection by content analysis and URL blacklists. Other systems implementing mixed methods,
like CopyCatch4 and SynchroTrap,10
also score comparatively low false
positive rates with respect to, for example, network-based methods.
Master of Puppets
If social bots are the puppets, additional efforts will have to be directed
at finding their masters. Governmentsg and other entities with sufficient resourcesh have been alleged
to use social bots to their advantage.
g Russian Twitter political protests swamped
by spam; www.bbc.com/news/technology-16108876
h Fake Twitter accounts used to promote tar
sands pipeline; www.theguardian.com/environment/2011/aug/05/fake-twitter-tar-sandspipeline
If social bots
are the puppets,
additional efforts
will have to be
directed at finding
their masters.
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T H E ACM
103
review articles
only when the effectiveness of early
detection will sufficiently increase
the cost of deception.
The future of social media ecosystems might already point in the
direction of environments where
machine-machine interaction is the
norm, and humans navigate a world
populated mostly by bots. We believe
there is a need for bots and humans
to be able to recognize each other, to
avoid bizarre, or even dangerous, situations based on false assumptions
of human interlocutors.j
Acknowledgments
The authors are grateful to Qiaozhu
Mei, Zhe Zhao, Mohsen JafariAsbagh,
Prashant Shiralkar, and Aram Galstyan
for helpful discussions.
This work is supported in part by the
Office of Naval Research (grant N15A020-0053), National Science Foundation (grant CCF-1101743), DARPA
(grant W911NF-12-1-0037), and the
James McDonnell Foundation (grant
220020274). The funders had no role
in study design, data collection and
analysis, decision to publish, or preparation of the manuscript.
j That Time 2 Bots Were Talking, and Bank
of America Butted In; www.theatlantic.com/
technology/
References
1. Abokhodair, N., Yoo, D. and McDonald, D.W. Dissecting
a social botnet: Growth, content, and influence in
Twitter. In Proceedings of the 18th ACM Conference
on Computer-Supported Cooperative Work and Social
Computing (2015). ACM.
2. Aiello, L.M., Deplano, M., Schifanella, R. and Ruffo,
G. People are strange when youre a stranger:
Impact and influence of bots on social networks. In
Proceedings of the 6th AAAI International Conference
on Weblogs and Social Media (2012). AAAI, 1017.
3. Alvisi, L., Clement, A., Epasto, A., Lattanzi, S. and
Panconesi, A. Sok: The evolution of sybil defense via
social networks. In Proceedings of the 2013 IEEE
Symposium on Security and Privacy. IEEE, 382396.
4. Beutel, A., Xu, W., Guruswami,V., Palow, C. and
Faloutsos, C. Copy-Catch: stopping group attacks
by spotting lockstep behavior in social networks. In
Proceedings of the 22nd International Conference on
World Wide Web (2013), 119130.
5. Bollen, J., Mao, H. and Zeng, X. Twitter mood predicts
the stock market. J. Computational Science 2, 1
(2011), 18.
6. Boshmaf, Y., Muslukhov, I., Beznosov, K. and Ripeanu,
M. Key challenges in defending against malicious
socialbots. In Proceedings of the 5th USENIX
Conference on Large-scale Exploits and Emergent
Threats, Vol. 12 (2012).
7. Boshmaf, Y., Muslukhov, I., Beznosov, K. and Ripeanu,
M. 2013. Design and analysis of a social botnet.
Computer Networks 57, 2 (2013), 556578.
8. Briscoe, E.J., Appling, D.S. and Hayes, H. Cues
to deception in social media communications.
In Proceedings of the 47th Hawaii International
Conference on System Sciences (2014). IEEE,
14351443.
9. Cao, Q., Sirivianos, M., Yang, X. and Pregueiro, T. Aiding
the detection of fake accounts in large scale social
online services. NSDI (2012). 197210.
104
COMM UNICATIO NS O F T H E AC M
10. Cao, Q., Yang, X., Yu, J. and Palow, C. Uncovering large
groups of active malicious accounts in online social
networks. In Proceedings of the 2014 ACM SIGSAC
Conference on Computer and Communications
Security. ACM, 477488.
11. Cassa, C.A., Chunara, R., Mandl, K. and Brownstein,
J.S. Twitter as a sentinel in emergency
situations: Lessons from the Boston marathon
explosions. PLoS Currents: Disasters (July
2013); http://dx.doi.org/10.1371/currents.dis.
ad70cd1c8bc585e9470046cde334ee4b
12. Conover, M., Ratkiewicz, J., Francisco, M., Gonalves,
B., Menczer, F. and Flammini, A. Political polarization
on Twitter. In Proceedings of the 5th International
AAAI Conference on Weblogs and Social Media
(2011), 8996.
13. Davis, C.A., Varol, O., Ferrara, E., Flammini, A. and
Menczer, F. BotOrNot: A system to evaluate social
bots. In Proceedings of the 25th International World
Wide Web Conference Companion (2016); http://dx.doi.
org/10.1145/2872518.2889302 Forthcoming. Preprint
arXiv:1602.00975.
14. Edwards, C., Edwards, A., Spence, P.R. and Shelton,
A.K. Is that a bot running the social media
feed? Testing the differences in perceptions of
communication quality for a human agent and a bot
agent on Twitter. Computers in Human Behavior 33
(2014), 372376.
15. Elovici, Y., Fire, M., Herzberg, A. and Shulman, H.
Ethical considerations when employing fake identities
in online social networks for research. Science and
Engineering Ethics (2013), 117.
16. Elyashar, A., Fire, M., Kagan, D. and Elovici, Y. Homing
socialbots: Intrusion on a specific organizations
employee using Socialbots. In Proceedings of the
2013 IEEE/ACM International Conference on
Advances in Social Networks Analysis and Mining.
ACM, 13581365.
17. Freitas, C.A. et al. Reverse engineering socialbot
infiltration strategies in Twitter. In Proceedings of
the 2015 IEEE/ACM International Conference on
Advances in Social Networks Analysis and Mining.
ACM 2015.
18. Ghosh, R., Surachawala, T. and Lerman, K. Entropybased classification of retweeting activity on Twitter.
In Proceedings of the KDD Workshop on Social
Network Analysis (2011).
19. Golder, S.A. and Macy, M.W. Diurnal and seasonal
mood vary with work, sleep, and daylength
across diverse cultures. Science 333, 6051 (2011),
18781881.
20. Gupta, A., Lamba, H. and Kumaraguru, P. $1.00 per
RT #BostonMarathon #PrayForBoston: Analyzing fake
content on Twitter. eCrime Researchers Summit. IEEE
(2013), 112.
21. Heymann, P., Koutrika, G. and Garcia-Molina,
H. Fighting spam on social web sites: A survey
of approaches and future challenges. Internet
Computing 11, 6 (2007). IEEE, 3645.
22. Hwang, T., Pearce, I. and Nanis, M. Socialbots: Voices
from the fronts. ACM Interactions 19, 2 (2012), 3845.
23. Kramer, A.D. Guillory, J.E. and Hancock, J.T.
Experimental evidence of massive-scale emotional
contagion through social networks. In Proceedings
of the National Academy of Sciences (2014),
201320040.
24. Lee, K., Eoff, B.D., and Caverlee, J. Seven months with
the devils: A long-term study of content polluters on
Twitter. In Proceedings of the 5th International AAAI
Conference on Weblogs and Social Media (2011),
185192.
25. Messias, J., Schmidt, L., Oliveira, R. and
Benevenuto, F. You followed my bot! Transforming
robots into influential users in Twitter. First
Monday 18, 7 (2013).
26. Metaxas, P.T. and Mustafaraj, E. Social media and the
elections. Science 338, 6106 (2012), 472473.
27. Paradise, A., Puzis, R. and Shabtai, A. Antireconnaissance tools: Detecting targeted socialbots.
Internet Computing 18, 5 (2014), 1119.
28. Ratkiewicz, J., Conover, M., Meiss, M., Gonalves, B.,
Flammini, A. and Menczer, F. Detecting and tracking
political abuse in social media. In Proceedings of the
5th International AAAI Conference on Weblogs and
Social Media (2011). 297304.
29. Ratkiewicz, J., Conover, M., Meiss, M., Gonalves, B.,
Patil, S., Flammini, A. and Menczer, F. Truthy: Mapping
the spread of astroturf in microblog streams. In
Proceedings of the 20th International Conference on
the World Wide Web (2011), 249252.
30. Stein, T., Chen, E. and Mangla, K. Facebook immune
| J U LY 201 6 | VO L . 5 9 | NO. 7
research highlights
P. 106
Technical
Perspective
Combining Logic
and Probability
P. 107
P. 116
Technical
Perspective
Mesa Takes
Data Warehousing
to New Heights
By Sam Madden
P. 117
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T H E ACM
105
research highlights
DOI:10.1145/ 2 9 3 6 72 4
Technical Perspective
Combining Logic
and Probability
rh
A GOAL OF
106
COM MUNICATIO NS O F TH E AC M
| J U LY 201 6 | VO L . 5 9 | NO. 7
DOI:10.1145 / 2 9 3 6 72 6
Our solution is obtained by reducing probabilistic theorem proving (PTP) to lifted weighted model counting. We
first do the corresponding reduction for the propositional
case, extending previous work by Sang et al.22 and Chavira
and Darwiche.3 We then lift this approach to the first-order
level, and refine it in several ways. We show that our algorithm
can be exponentially more efficient than first-order variable
elimination, and is never less efficient (up to constants). For
domains where exact inference is not feasible, we propose
a sampling-based approximate version of our algorithm.
Finally, we report experiments in which PTP greatly outperforms first-order variable elimination and belief propagation,
and discuss future research directions.
2. LOGIC AND THEOREM PROVING
We begin with a brief review of propositional logic, first-order
logic and theorem proving.
The simplest formulas in propositional logic are atoms:
individual symbols representing propositions that may be
true or false in a given world. More complex formulas are
recursively built up from atoms and the logical connectives
(negation), (conjunction), (disjunction), (implication)
and (equivalence). For example, A (B C) is true iff
A is false or B and C are true. A knowledge base (KB) is a set
of logical formulas. The fundamental problem in logic is
determining entailment, and algorithms that do this are
called theorem provers. A knowledge base K entails a query
formula Q iff Q is true in all worlds in which all formulas in
K are true, a world being an assignment of truth values to
all atoms.
A world is a model of a KB iff the KB is true in it. Theorem
provers typically first convert K and Q to conjunctive normal form (CNF). A CNF formula is a conjunction of clauses,
each of which is a disjunction of literals, each of which is an
atom or its negation. For example, the CNF of A (B C) is
(A B) (A C). A unit clause consists of a single literal.
Entailment can then be computed by adding Q to K and
determining whether the resulting KB KQ is satisfiable, that
is, whether there exists a world where all clauses in KQ are
true. If not, KQ is unsatisfiable, and K entails Q. Algorithm 1
shows this basic theorem proving schema. CNF(K) converts K
to CNF, and SAT(C) returns True if C is satisfiable and False
otherwise.
Algorithm 1 TP(KB K, query Q)
KQ K {Q}
return SAT(CNF(KQ))
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T H E ACM
107
research highlights
The earliest theorem prover is the DavisPutnam algorithm
(henceforth called DP).6 It makes use of the resolution rule:
if a KB contains the clauses A1 ... An and B1 ... Bm,
where the As and Bs represent literals, and some literal Ai is
the negation of some literal Bj, then the clause A1 ... Ai1
Ai+1 ... An B1 ... Bj1 Bj+1 ... Bm can be added
to the KB. For each atom A in the KB, DP resolves every pair
of clauses C1, C2 in the KB such that C1 contains A and C2
contains A, and adds the result to the KB. It then deletes
all clauses that contain a literal of A from the KB. If at some
point the empty clause is derived, the KB is unsatisfiable,
and the query formula (previously negated and added to the
KB) is therefore proven to be entailed by the KB. DP is in fact
just the variable elimination algorithm for the special case
of 0-1 potentials.
Modern propositional theorem provers use the DPLL
algorithm,5 a variant of DP that replaces the elimination
step with a splitting step: instead of eliminating all clauses
containing the chosen atom A, resolve all clauses in the KB
with A, simplify and recurse, and do the same with A. If
both recursions fail, the KB is unsatisfiable. DPLL has linear space complexity, compared to exponential for Davis
Putnam, and is the basis of the algorithms in this paper.
First-order logic inherits all the features of propositional logic, and in addition allows atoms to have internal
structure. An atom is now a predicate symbol, representing
a relation in the domain of interest, followed by a paren
thesized list of variables and/or constants, representing
objects. For example, Friends(Anna, x) is an atom. Aground
atom has only constants as arguments. First-order logic
has two additional connectives, (universal quantification) and (existential quantification). For example, x
Friends(Anna,x) means that Anna is friends with everyone, and x Friends(Anna,x) means that Anna has at
least one friend. In this paper, we assume that domains
are finite (and therefore function-free) and that there is a
one-to-one mapping between constants and objects in the
domain (Herbrand interpretations).
As long as the domain is finite, first-order theorem proving can be carried out by propositionalization: creating atoms
from all possible combinations of predicates and constants,
and applying a propositional theorem prover. However, this
is potentially very inefficient. A more sophisticated alternative
is first-order resolution,20 which proceeds by resolving pairs
of clauses and adding the result to the KB until the empty
clause is derived. Two first-order clauses can be resolved if
they contain complementary literals that unify, that is, there
is a substitution of variables by constants or other variables
that makes the two literals identical up to the negation sign.
Conversion to CNF is carried out as before, with the additional
step of removing all existential quantifiers by a process called
skolemization.
First-order logic allows knowledge to be expressed more
concisely than propositional logic. For example, the rules
of chess can be stated in a few pages in first-order logic,
but require hundreds of thousands in propositional logic.
Probabilistic logical languages extend this power to uncertain domains. The goal of this paper is to similarly extend
the power of first-order theorem proving.
108
COMM UNICATIO NS O F T H E AC M
| J U LY 201 6 | VO L . 5 9 | NO. 7
3. PROBLEM DEFINITION
Following Nilsson,17 we define PTP as the problem of determining the probability of an arbitrary query formula Q given
a set of logical formulas Fi and their probabilities P(Fi). For
the problem to be well defined, the probabilities must be
consistent, and Nilsson17 provides a method for verifying consistency. Probabilities estimated by maximum likelihood
from an observed world are guaranteed to be consistent.
In general, a set of formula probabilities does not specify
a complete joint distribution over the atoms appearing in
them, but one can be obtained by making the maximum
entropy assumption: the distribution contains no information beyond that specified by the formula probabilities.17
Finding the maximum entropy distribution given a set of formula probabilities is equivalent to learning a maximumlikelihood log-linear model whose features are the formulas;
many algorithms for this purpose are available (iterative
scaling, gradient descent, etc.).
We call a set of formulas and their probabilities together
with the maximum entropy assumption a probabilistic knowledge base (PKB). Equivalently, a PKB can be directly defined
as a log-linear model with the formulas as features and the
corresponding weights or potential values. Potentials are
the most convenient form, since they allow determinism
(0-1 probabilities) without recourse to infinity. If x is a world
and i(x) is the potential corresponding to formula Fi, by
convention (and without loss of generality) we let i(x) = 1 if
Fi is true, and i(x) = i 0 if the formula is false. Hard formulas
have i = 0 and soft formulas have i > 0. In order to compactly
subsume standard probabilistic models, we interpret a universally quantified formula as a set of features, one for each
grounding of the formula, as in Markov logic.10 A PKB {(Fi, i)}
thus represents the joint distribution
(1)
where ni(x) is the number of false groundings of Fi in x, and
Z is a normalization constant (the partition function). We can
now define PTP succinctly as follows:
Probabilistic theorem proving (PTP)
Input: Probabilistic KB K and query formula Q
Output: P(Q|K)
If all formulas are hard, a PKB reduces to a standard logical KB. Determining whether a KB K logically entails a
query Q is equivalent to determining whether P(Q|K) = 1.10
Graphical models can be easily converted into equivalent
PKBs.3 Conditioning on evidence is done by adding the corresponding hard ground atoms to the PKB, and the conditional marginal of an atom is computed by issuing the atom
as the query. Thus PTP has both logical theorem proving and
inference in graphical models as special cases.
In this paper, we solve PTP by reducing it to lifted weighted
model counting. Model counting is the problem of determining the number of worlds that satisfy a KB. Weighted model
counting can be defined as follows.3 Assign a weight to each
literal, and let the weight of a world be the product of the
weights of the literals that are true in it. Then weighted
LWSAT
PI = WMC
MPE = WSAT
ed
ht
ig
We
Counting
TP0 = SAT
MC
LMC
Lifted
TP1
PTP = LWMC
// Base case
if all clauses in C are satisfied then
return AA(C) (WA + WA)
if C has an empty unsatisfied clause then return 0
// Decomposition step
if C can be partitioned into CNFs C1, ... , Ck sharing no atoms then
return ik= 1 WMC (Ci, W )
// Splitting step
Choose an atom A
return WA WMC(C|A; W) + WA WMC(C|A; W)
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T H E ACM
109
research highlights
Theorem 2. Algorithm WMC(C,W) correctly computes the
weighted model count of CNF C under literal weights W.
Proof sketch. If all clauses in C are satisfied, all assignments to the atoms in C satisfy it, and the corresponding
total weight is AA(C )(WA + WA). If C has an empty unsatisfied clause, it is unsatisfiable given the truth assignment so
far, and the corresponding weighted count is 0. If two CNFs
share no atoms, the WMC of their conjunction is the product
of the WMCs of the individual CNFs. Splitting on an atom
produces two disjoint sets of worlds, and the total WMC is
therefore the sum of the WMCs of the two sets, weighted by
the corresponding literals weight.
5. FIRST-ORDER CASE
We now lift PTP to the first-order level. We consider first the
case of PKBs without existential quantifiers. Algorithms 2
and 3 remain essentially unchanged, except that formulas, literals and CNF conversion are now first-order. In particular, for Theorem 1 to remain true, each new atom Ai in
Algorithm 2 must now consist of a new predicate symbol followed by a parenthesized list of the variables and constants
in the corresponding formula Fi. The proof of the first-order
version of the theorem then follows by propositionalization.
Lifting Algorithm 4 is the focus of the rest of this section.
We begin with some necessary definitions. A substitution
constraint is an expression of the form x = y or x y, where
x is a variable and y is either a variable or a constant. (Much
richer substitution constraint languages are possible, but
we adopt the simplest one that allows PTP to subsume both
standard function-free theorem proving and first-order variable elimination.) Two literals are unifiable under a set of
substitution constraints S if there exists at least one ground
literal consistent with S that is an instance of both, up to
the negation sign. A (C, S) pair, where C is a first-order CNF
whose variables have been standardized apart and S is a
set of substitution constraints, represents the ground CNF
obtained by replacing each clause in C with the conjunction
of its groundings that are consistent with the constraints
in S. For example, using upper case for constants and lower
case for variables, and assuming that the PKB contains only
two constants A and B, if C = R(A, B) (R(x, y) S(y, z)) and
S = {x = y, z A}, (C, S) represents the ground CNF R(A, B)
(R(A, A) S(A, B)) (R(B, B) S(B, B)). Clauses with
equality substitution constraints can be abbreviated in the
obvious way (e.g., T(x, y, z) with x = y and z = C can be abbreviated as T(x,x,C)).
We lift the base case, decomposition step, and splitting
step of Algorithm 4 in turn. The result is shown in Algorithm 5.
Inaddition to the first-order CNF C and weights on first-order
literals W, LWMC takes as an argument an initially empty set
of substitution constraints S which, similar to logical theorem proving, is extended along each branch of the inference
as the algorithm progresses.
5.1. Lifting the base case
The base case changes only by raising each first-order atom
As sum of weights to nA(S), the number of groundings of
A compatible with the constraints in S. This is necessary and
110
CO MM UNICATIO NS O F T H E AC M
| J U LY 201 6 | VO L . 5 9 | NO. 7
(3)
Rules for identifying lifted decompositions can be derived
in a straightforward manner from the inversion argument in
de Salvo Braz8 and the power rule in Jha et al.15 An example of
such a rule is given in the definition and proposition below.
Definition 2. A set of variables X = {x1, ... , xm} is called a
decomposer of a CNF C if it satisfies the following three properties: (i) for each clause Cj in C, there is exactly one variable xi in
X such that xi appears in all atoms in Cj; (ii) if xi X appears
as an argument of predicate R (say at position k in an atom having predicate symbol R), then all variables in all clauses that
Throughout this paper, when we say that two clauses are identical, we mean
that they are identical up to a renaming of constants and variables.
where
, ti, and fi are the number of true and
false atoms in respectively, and Si is S augmented with the
substitution constraints required to form C|.
Again, we can derive rules for identifying a lifted split
by using the counting arguments in de Salvo Braz8 and the
generalized binomial rule in Jha et al.15 We omit the details
for lack of space. In the worst case, lifted splitting defaults
to splitting on a ground atom. In most inference problems,
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T H E ACM
111
research highlights
the PKB will contain many hard ground unit clauses (the evidence). Splitting on the corresponding ground atoms then
reduces to a single recursive call to LWMC for each atom.
In general, the atom to split on in Algorithm 5 should be
chosen with the goal of yielding lifted decompositions in
the recursive calls (e.g., using lifted versions of the propositional heuristics23).
Notice that the lifting schemes used for decomposition
and splitting in Algorithm 5 by no means exhaust the space of
possible probabilistic lifting rules. For example, Jha et al.15
and Milch et al.16 contain examples of other lifting rules.
Searching for new probabilistic lifted inference rules, and
positive and negative theoretical results about what can be
lifted, looks like a fertile area for future research.
The theorem below follows from Theorem 2 and the arguments above.
Theorem 3. Algorithm LWMC(C, 0/, W) correctly computes the
weighted model count of CNF C under literal weights W.
5.4. Extensions and discussion
Although most probabilistic logical languages do not include
existential quantification, handling it in PTP is desirable for
the sake of logical completeness. This is complicated by the
fact that skolemization is not sound for model counting
(skolemization will not change satisfiability but can change
the model count), and so cannot be applied. The result of
conversion to CNF is now a conjunction of clauses with universally and/or existentially quantified variables (e.g., [xy
(R(x,y) S(y))] [uvwT(u,v,w)]). Algorithm 5 now
needs to be able to handle clauses of this form. If no universal quantifier appears nested inside an existential one, this
is straightforward, since in this case an existentially quantified clause is just a compact representation of a longer one.
For example, if the domain is {A, B, C}, the unit clause xy
R(x, y) represents the clause x (R(x, A) R(x, B) R(x, C)).
The decomposition and splitting steps in Algorithm 5 are
both easily extended to handle such clauses without loss
of lifting (and the base case does not change). However, if
universals appear inside existentials, a first-order clause
now corresponds to a disjunction of conjunctions of propositional clauses. For example, if the domain is {A, B}, xy
(R(x, y) S(x, y)) represents (R(A, A) S(A, A)) (R(A, B)
S(A, B)) (R(B, A) S(B, A)) (R(B, B) S(B, B)). Whether
these cases can be handled without loss of lifting remains
an open question.
Several optimizations of the basic LWMC procedure in
Algorithm 5 can be readily ported from the algorithms PTP
generalizes. These optimizations can tremendously improve
the performance of LWMC.
Unit Propagation When LWMC splits on atom A, the
clauses in the current CNF are resolved with the unit
clauses A and A. This results in deleting false atoms, which
may produce new unit clauses. The idea in unit propagation is to in turn resolve all clauses in the new CNF with
all the new unit clauses, and continue to do this until no
further unit resolutions are possible. This often produces
a much smaller CNF, and is a key component of DPLL
that can also be used in LWMC. Other techniques used
112
| J U LY 201 6 | VO L . 5 9 | NO. 7
6. APPROXIMATE INFERENCE
LWMC lends itself readily to Monte Carlo approximation, by
replacing the sum in the splitting step with a random choice of
one of its terms, calling the algorithm many times, and averaging the results. This yields the first lifted sampling algorithm.
We first apply this importance sampling approach21 to
WMC, yielding the MC-WMC algorithm. The two algorithms
differ only in the last line. Let Q(A|C, W) denote the importance
or proposal distribution over A given the current CNF C and
MC-WMC(C|A;W)
literal weights W. Then we return
MC-WMC(C|A; W) othwith probability Q(A|C, W), or
erwise. By importance sampling theory21 and by the law of
total expectation, it is easy to show that:
Theorem 6. If Q(A|C, W) satisfies WMC(C|A; W) > 0 Q(A|C, W)
> 0 for all atoms A and its true and false assignments, then the
expected value of the quantity output by MC-WMC(C, W) equals
WMC(C, W). In other words, MC-WMC(C, W) yields an unbiased
estimate of WMC(C, W).
An estimate of WMC(C, W) is obtained by running MCWMC(C, W) multiple times and averaging the results. By linearity of expectation, the running average is also unbiased. It
is well known that the accuracy of the estimate is inversely proportional to its variance.21 The variance can be reduced by either
running MC-WMC more times or by choosing Q that is as close
as possible to the posterior distribution P (or both). Thus, for
MC-WMC to be effective in practice, at each point, given the current CNF C, we should select Q(A|C, W) that is as close as possible to the marginal probability distribution of A w.r.t. C and W.
In presence of hard formulas, MC-WMC suffers from the
rejection problem12: it may return a zero. We can solve this
problem by either backtracking when a sample is rejected or by
generating samples from the backtrack-free distribution.12
Next, we present a lifted version of MC-WMC, which is
obtained by replacing the (last line of the) lifted splitting
step in LWMC by the following lifted sampling step:
return
MC-LWMC(C|, Si, W)
113
research highlights
with 20% of the atoms set as observed. We see that FOVE is
unable to solve any problems after the number of objects is
increased beyond 100 because it runs out of memory. PTP,
on the other hand, solves all problems in less than 100s.
7.2. Approximate inference
In this subsection, we compare the performance of MCLWMC, MC-WMC, lifted belief propagation,24 and MC-SAT19
on two domains: entity resolution (Cora) and collective classification. The Cora dataset contains 1295 citations to 132
different research papers. The inference task here is to detect
duplicate citations, authors, titles, and venues. The collective
classification dataset consists of about 3000 query atoms.
Since computing the exact posterior marginals is infeasible in these domains, we used the following evaluation
method. We partitioned the data into two equalsized sets:
evidence set and test set. We then computed the probability
of each ground atom in the test set given all atoms in the evidence set using the four inference algorithms. We measure
the error using negative log-likelihood of the data according
to the inference algorithms (the negative log-likelihood is a
sampling approximation of the KL divergence to the datagenerating distribution, shifted by its entropy).
Figure 2. (a) Impact of increasing the amount of evidence on the
time complexity of FOVE and PTP in the link prediction domain.
Thenumber of objects in the domain is 100. (b) Impact of increasing
the number of objects on the time complexity of FOVE and PTP in the
link prediction domain, with 20% of the atoms set as evidence.
100000
Time (seconds)
8. CONCLUSION
Probabilistic theorem proving (PTP) combines theorem
proving and probabilistic inference. This paper proposed
an algorithm for PTP based on reducing it to lifted weighted
model counting, and showed both theoretically and empirically that it has significant advantages compared to previous
lifted probabilistic inference algorithms. An implementation of PTP is available in the Alchemy 2.0 system (available
at https://code.google.com/p/alchemy-2/).
Directions for future research include: extension of PTP to
infinite, non-Herbrand first-order logic; new lifted inference
rules; theoretical analysis of liftability; porting to PTP more
speedup techniques from logical and probabilistic inference;
lifted splitting heuristics; better handling of existentials;
variational PTP algorithms; better importance distributions;
approximate lifting; answering multiple queries simultaneously; applications; etc.
10000
PTP
FOVE
10000
1000
100
10
1
0.1
1000
100
10
0.1
0.01
0.01
10
20
30
40
50
60
70
Lifted-BP
MC-SAT
MC-WMC
MC-LWMC
80
10
Time (seconds)
100
10
1
0.1
0.01
200
300
400
500
Number of objects
(b)
114
CO M MUNICATIO NS O F TH E AC M
50
60
50
60
10000
1000
100
40
(a)
PTP
FOVE
10000
30
Time (minutes)
20
| J U LY 201 6 | VO L . 5 9 | NO. 7
1000
100
10
Lifted-BP
MC-SAT
MC-WMC
MC-LWMC
1
0.1
0.01
0
10
20
30
40
Time (minutes)
(b)
Acknowledgments
This research was funded by the ARO MURI grant
W911NF-08-1-0242, AFRL contracts FA8750-09-C-0181 and
FA8750-14-C-0021, DARPA contracts FA8750-05-2-0283,
FA8750-14-C-0005, FA8750-07-D-0185, HR0011-06-C-0025,
HR0011-07-C-0060, and NBCH-D030010, NSF grants IIS0534881 and IIS-0803481, and ONR grant N00014-08-1-0670.
The views and conclusions contained in this document are
those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or
implied, of ARO, DARPA, NSF, ONR, or the U.S. Government.
References
1. Bacchus, F. Representing and Reasoning
with Probabilistic Knowledge. MIT
Press, Cambridge, MA, 1990.
2. Bayardo, R.J., Jr., Pehoushek, J.D.
Counting models using connected
components. In Proceedings of the
Seventeenth National Conference on
Artificial Intelligence (2000), 157162.
3. Chavira, M., Darwiche, A. On
probabilistic inference by weighted
model counting. Artif. Intell. 172, 67
(2008), 772799.
4. Darwiche, A. Recursive conditioning.
Artif. Intell. 126, 12 (February 2001),
541.
5. Davis, M., Logemann, G., Loveland, D. A
machine program for theorem proving.
Commun. ACM 5 (1962) 394397.
6. Davis, M., Putnam, H. A computing
procedure for quantification theory.
J. Assoc. Comput. Mach. 7, 3 (1960),
201215.
nodesummit-half-page_160513.indd 1
20.
21.
22.
23.
24.
25.
5/13/16 2:55 PM
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T H E ACM
115
research highlights
DOI:10.1145/ 2 9 3 6 72 0
Technical Perspective
Mesa Takes Data Warehousing
to New Heights
rh
By Sam Madden
L E AV E I T TO
116
CO M MUNICATIO NS O F TH E AC M
A natural question
is how Mesa
compares to existing
parallel transactional
database systems?
| J U LY 201 6 | VO L . 5 9 | NO. 7
DOI:10.1145 / 2 9 3 6 72 2
By Ashish Gupta, Fan Yang, Jason Govig, Adam Kirsch, Kelvin Chan, Kevin Lai, Shuo Wu, Sandeep Dhoot, Abhilash Rajesh
Kumar, Ankur Agiwal, Sanjay Bhansali, Mingsheng Hong, Jamie Cameron, Masood Siddiqi, David Jones, Jeff Shute, Andrey
Gubarev, Shivakumar Venkataraman, and Divyakant Agrawal
Abstract
Mesa is a highly scalable analytic data warehousing system
that stores critical measurement data related to Googles
Internet advertising business. Mesa is designed to satisfy a
complex and challenging set of user and systems requirements, including near real-time data ingestion and retrieval,
as well as high availability, reliability, fault tolerance, and
scalability for large data and query volumes. Specifically,
Mesa handles petabytes of data, processes millions of row
updates per second, and serves billions of queries that fetch
trillions of rows per day. Mesa is geo-replicated across multiple datacenters and provides consistent and repeatable
query answers at low latency, even when an entire datacenter fails. This paper presents the Mesa system and reports
the performance and scale that it achieves.
1. INTRODUCTION
Google runs an extensive advertising platform across multiple channels that serves billions of advertisements (or ads)
every day to users all over the globe. Detailed information
associated with each served ad, such as the targeting criteria, number of impressions and clicks, etc., are recorded
and processed in real time. This data is used extensively at
Google for different use cases, including reporting, internal auditing, analysis, billing, and forecasting. Advertisers
gain fine-grained insights into their advertising campaign
performance by interacting with a sophisticated front-end
service that issues online and on-demand queries to the
underlying data store. Googles internal ad serving platforms use this data in real time, determining budgeting
and ad performance to enhance ad serving relevancy. As
the Google ad platform continues to expand and as internal
and external customers request greater visibility into their
advertising campaigns, the demand for more detailed and
fine-grained information leads to tremendous growth in
the data size. The scale and business critical nature of this
data result in unique technical and operational challenges
for processing, storing, and querying. The requirements for
such a data store are:
Atomic Updates. A single user action may lead to multiple
updates at the relational data level, affecting thousands of
consistent views, defined over a set of metrics (e.g., clicks
and cost) across a set of dimensions (e.g., advertiser and
country). It must not be possible to query the system in a
state where only some of the updates have been applied.
Consistency and Correctness. For business and legal reasons, this system must return consistent and correct data.
We require strong consistency and repeatable query results
even if a query involves multiple datacenters.
Availability. The system must not have any single point of
failure. There can be no downtime in the event of planned or
unplanned maintenance or failures, including outages that
affect an entire datacenter or a geographical region.
Near Real-Time Update Throughput. The system must support continuous updates, both new rows and incremental
updates to existing rows, with the update volume on the
order of millions of rows updated per second. These updates
should be available for querying consistently across different views and datacenters within minutes.
Query Performance. The system must support latency-sensitive
users serving live customer reports with very low latency
requirements and batch extraction users requiring very
high throughput. Overall, the system must support point
queries with 99th percentile latency in the hundreds of milliseconds and overall query throughput of trillions of rows
fetched per day.
Scalability. The system must be able to scale with the growth
in data size and query volume. For example, it must support
trillions of rows and petabytes of data. The update and query
performance must hold even as these parameters grow
significantly.
Online Data and Metadata Transformation. In order to support new feature launches or change the granularity of existing data, clients often require transformations of the data
schema or modifications to existing data values. These
changes must not interfere with the normal query and
update operations.
Mesa is Googles solution to these technical and operational challenges for business critical data. Mesa is a distributed, replicated, and highly available data processing,
storage, and query system for structured data. Mesa ingests
data generated by upstream services, aggregates and persists the data internally, and serves the data via user queries.
Even though this paper mostly discusses Mesa in the context
The original version of this paper, entitled Mesa: GeoReplicated, Near Real-Time, Scalable Warehousing, was
published in the Proceedings of the VLDB Endowment 7, 12
(Aug. 2014), 12591270.
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T H E ACM
117
research highlights
of ads metrics, Mesa is a generic data warehousing solution
that satisfies all of the above requirements.
Mesa leverages common Google infrastructure and services, such as Colossus (Googles distributed file system),7
BigTable,3 and MapReduce.6 To achieve storage scalability
and availability, data is horizontally partitioned and replicated. Updates may be applied at the granularity of a single
table or across many tables. To achieve consistent and repeatable queries during updates, the underlying data is multiversioned. To achieve update scalability, data updates are
batched, assigned a new version number, and periodically
(e.g., every few minutes) incorporated into Mesa. To achieve
update consistency across multiple data centers, Mesa uses a
distributed synchronization protocol based on Paxos.11
In contrast, commercial DBMS vendors4, 14, 17 often
address the scalability challenge through specialized hardware and sophisticated parallelization techniques. Internet
services companies1, 12, 16 address this challenge using a combination of new technologies: key-value stores,3, 8, 13 columnar storage, and the MapReduce programming paradigm.
However, many of these systems are designed to support
bulk load interfaces to import data and can require hours to
run. From that perspective, Mesa is very similar to an OLAP
system. Mesas update cycle is minutes and it continuously
processes hundreds of millions of rows. Mesa uses multiversioning to support transactional updates and queries
across tables. A system that is close to Mesa in terms of supporting both dynamic updates and real-time querying of
transactional data is Vertica.10 However, to the best of our
knowledge, none of these commercial products or production systems has been designed to manage replicated data
across multiple datacenters. Also, none of Googles other inhouse data solutions2, 3, 5, 15 support the data size and update
volume required to serve as a data warehousing platform
supporting Googles advertising business.
Mesa achieves the required update scale by processing
updates in batches. Mesa is, therefore, unique in that application data is redundantly (and independently) processed at
all datacenters, while the metadata is maintained using synchronous replication. This approach minimizes the synchronization overhead across multiple datacenters in addition to
providing additional robustness in face of data corruption.
2. MESA STORAGE SUBSYSTEM
Data in Mesa is continuously generated and is one of the
largest and most valuable data sets at Google. Analysis queries on this data can range from simple queries such as,
How many ad clicks were there for a particular advertiser
on a specific day? to a more involved query scenario such
as, How many ad clicks were there for a particular advertiser matching the keyword decaf during the first week of
October between 8:00 am and 11:00 am that were displayed
on google.com for users in a specific geographic location
using a mobile device?
Data in Mesa is inherently multi-dimensional, capturing
all the microscopic facts about the overall performance of
Googles advertising platform in terms of different dimensions. These facts typically consist of two types of attributes:
dimensional attributes (which we call keys) and measure
118
COM MUNICATIO NS O F TH E AC M
| J U LY 201 6 | VO L . 5 9 | NO. 7
Date
2013/12/31
2014/01/01
2014/01/01
PublisherId
Country
100
US
100
US
200
UK
(a) Mesa table A
Clicks
10
205
100
Cost
32
103
50
Date
2013/12/31
2014/01/01
2014/01/01
2014/01/01
AdvertiserId
Country
1
US
1
US
2
UK
2
US
(b) Mesa table B
Clicks
10
5
100
200
Cost
32
3
50
100
AdvertiserId
1
2
2
Country
US
UK
US
(c) Mesa table C
Clicks
15
100
200
Cost
35
50
100
Date
2013/12/31
2014/01/01
2014/01/01
PublisherId
100
100
200
Country
US
US
UK
Clicks
+10
+150
+40
Cost
+32
+80
+20
Clicks
+10
+40
+150
Cost
+32
+20
+80
Clicks
+55
+60
Cost
+23
+30
Clicks
+5
+60
+50
Cost
+3
+30
+20
AdvertiserId
1
2
2
Country
US
UK
US
PublisherId
100
200
Country
US
UK
AdvertiserId
1
2
2
Country
US
UK
US
119
research highlights
The delta version [V1, V2] for a singleton corresponding to
an update with version number n is denoted by setting
V1 = V2 = n.
A delta [V1, V2] and another delta [V2 + 1, V3] can be
aggregated to produce the delta [V1, V3], simply by merging row keys and aggregating values accordingly. (As discussed in Section 2.4, the rows in a delta are sorted by key,
and, therefore, two deltas can be merged in linear time.)
The correctness of this computation follows from associativity of the aggregation function F. Notably, correctness does not depend on commutativity of F, as whenever
Mesa aggregates two values for a given key, the delta versions are always of the form [V1, V2] and [V2 + 1, V3], and
the aggregation is performed in the increasing order of
versions.
Mesa allows users to query at a particular version for
only a limited time period (e.g., 24hours). This implies
that versions that are older than this time period can
be aggregated into a base delta (or, more simply, a base)
with version [0, B] for some base version B 0, and after
that any other deltas [V1, V2] with 0 V1 V2 B can be
deleted. This process is called base compaction, and
Mesa performs it concurrently and asynchronously with
respect to other operations (e.g., incorporating updates
and answering queries).
Note that for compaction purposes, the time associated with an update version is the time that version was
generated, which is independent of any time series information that may be present in the data. For example,
for the Mesa tables in Figure 1, the data associated with
2014/01/01 is never removed. However, Mesa may reject
a query to the particular depicted version after some
time. The date in the data is just another attribute and is
opaque to Mesa.
With base compaction, to answer a query for version
number n, we could aggregate the base delta [0, B] with
all singleton deltas [B + 1, B + 1], [B + 2, B + 2], ... , [n, n],
and then return the requested rows. Even though we run
base compaction frequently (e.g., every day), the number
of singletons can still easily approach hundreds (or even a
thousand), especially for update intensive tables. In order to
support more efficient query processing, Mesa maintains a
set of cumulative deltas D of the form [U, V] with B < U < V
through a process called cumulative compaction. These deltas
can be used to find a spanning set of deltas {[0, B], [B + 1, V1],
[V1 + 1, V2], ... , [Vk + 1, n]} for a version n that requires significantly less aggregation than simply using the singletons.
Of course, there is a storage and processing cost associated
with the cumulative deltas, but that cost is amortized over all
operations (particularly queries) that are able to use those
deltas instead of singletons.
The delta compaction policy determines the set of deltas
maintained by Mesa at any point in time. Its primary purpose
is to balance the processing that must be done for a query,
the latency with which an update can be incorporated into a
Mesa delta, and the processing and storage costs associated
with generating and maintaining deltas. More specifically,
the delta policy determines: (i)whatdeltas (excluding the
singleton) must be generated prior to allowing an update
120
| J U LY 201 6 | VO L . 5 9 | NO. 7
Base
Cumulatives
Singletons
060
6170
6180
61
6190
62
91
92
Updated every
day
Updated every 10
versions
Updated in
near real-time
Controllers
Updates
Update
workers
Compaction
workers
Schema
change
workers
Checksum
workers
Garbage
collector
121
research highlights
Mesa is resilient to controller failures, since a new controller
can reconstruct the state prior to the failure from the metadata in the BigTable.
Query subsystem. Mesas query subsystem consists of
query servers, illustrated in Figure 5. These servers receive
user queries, look up table metadata, determine the set of
files storing the required data, perform on-the-fly aggregation of this data, and convert the data from the Mesa internal format to the client protocol format before sending
the data back to the client. Mesas query servers provide
a limited query engine with basic support for server-side
conditional filtering and group by aggregation. Higherlevel database engines such as MySQL and F1 use these
primitives to provide richer SQL functionality such as join
queries.
Mesa clients have vastly different requirements and
performance characteristics. In some use cases, Mesa
receives queries directly from interactive reporting frontends, which have very strict low-latency requirements.
These queries are usually small but must be fulfilled
almost immediately. Mesa also receives queries from large
extraction-type workloads, such as offline daily reports,
that send millions of requests and fetch billions of rows
per day. These queries require high throughput and are
typically not latency sensitive (a few seconds/minutes of
latency is acceptable). Mesa ensures that these latency
and throughput requirements are met by requiring workloads to be labeled appropriately and then using those
labels in isolation and prioritization mechanisms in the
query servers.
The query servers for a single Mesa instance are organized into multiple sets, each of which is collectively
capable of serving all tables known to the controller. By
using multiple sets of query servers, it is easier to perform query server updates (e.g., binary releases) without unduly impacting clients, who can automatically
failover to another set in the same (or even a different)
Mesa instance. Within a set, each query server is in principle capable of handling a query for any table. However,
for performance reasons, Mesa prefers to direct queries
over similar data (e.g., all queries over the same table)
to a subset of the query servers. This technique allows
Mesa to provide strong latency guarantees by allowing
for effective query server in-memory pre-fetching and
caching of data stored in Colossus, while also allowing for excellent overall throughput by balancing load
Mesa clients
Committer
Updates on Colossus
Versions database
Query servers
Data on Colossus
Metadata BigTable
Mesa Instance
122
CO M MUNICATIO NS O F TH E ACM
Controller
Metadata
Controller
Metadata
Update workers
Mesa data
on Colossus
Update workers
Mesa data
on Colossus
Datacenter 1
| J U LY 201 6 | VO L . 5 9 | NO. 7
Datacenter 2
general as possible with minimal assumptions about current and future applications.
Geo-Replication. Although we support geo-replication in
Mesa for high data and system availability, we have also
seen added benefit in terms of our day-to-day operations.
In Mesas predecessor system, when there was a planned
maintenance outage of a datacenter, we had to perform a
laborious operations drill to migrate a 24 7 operational
system to another datacenter. Today, such planned outages, which are fairly routine, have minimal impact on
Mesa.
Data Corruption and Component Failures. Data corruption and
component failures are a major concern for systems at the
scale of Mesa. Data corruptions can arise for a variety of
reasons and it is extremely important to have the necessary
tools in place to prevent and detect them. Similarly, a faulty
component such as a floating-point unit on one machine
can be extremely hard to diagnose. Due to the dynamic
nature of the allocation of cloud machines to Mesa, it is
highly uncertain whether such a machine is consistently
active. Furthermore, even if the machine with the faulty unit
is actively allocated to Mesa, its usage may cause only intermittent issues. Overcoming such operational challenges
remains an open problem, but we discuss some techniques
used by Mesa in Ref.9
Testing and Incremental Deployment. Mesa is a large, complex,
critical, and continuously evolving system. Simultaneously
maintaining new feature velocity and the health of the
production system is a crucial challenge. Fortunately, we
have found that by combining some standard engineering
practices with Mesas overall fault-tolerant architecture
and resilience to data corruptions, we can consistently
deliver major improvements to Mesa with minimal risk.
Some of the techniques we use are: unit testing, private
developer Mesa instances that can run with a small fraction of production data, and a shared testing environment that runs with a large fraction of production data
from upstream systems. We are careful to incrementally
deploy new features across Mesa instances. For example,
when deploying a high-risk feature, we might deploy it to
one instance at a time. Since Mesa has measures to detect
data inconsistencies across multiple datacenters (along
with thorough monitoring and alerting on all components), we find that we can detect and debug problems
quickly.
5. MESA PRODUCTION METRICS
In this section, we report update and query processing
performance metrics for Mesas production deployment.
We show the metrics over a 7-day period to demonstrate
both their variability and stability. We also show system
growth metrics over a multi-year period to illustrate how
the system scales to support increasing data sizes with
linearly increasing resource requirements, while ensuring
the required query performance. Overall, Mesa is highly
decentralized and replicated over multiple datacenters,
using hundreds to thousands of machines at each datacenter for both update and query processing. Although we
do not report the proprietary details of our deployment,
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T H E ACM
123
research highlights
the architectural details that we do provide are comprehensive and convey the highly distributed, large-scale
nature of the system.
5.1. Update processing
Figure 7 illustrates Mesa update performance for one
data source over a 7-day period. Mesa supports hundreds
of concurrent update data sources. For this particular
data source, on average, Mesa reads 3060 megabytes
of compressed data per second, updating 36 million
distinct rows and adding about 300 thousand new rows.
The data source generates updates in batches about
every 5min, with median and 95th percentile Mesa commit times of 54 and 211s. Mesa maintains this update
latency, avoiding update backlog by dynamically scaling
resources.
5.2. Query processing
Figure 8 illustrates Mesas query performance over a 7-day
period for tables from the same data source as above. Mesa
executed more than 500 million queries per day for those
tables, returning 1.73.2 trillion rows. The nature of these
production queries varies greatly, from simple point lookups to large range scans. We report their average and 99th
percentile latencies, which show that Mesa answers most
queries within tens to hundreds of milliseconds. The large
Figure 7. Update performance for a single data source over a 7-day
period.
Percentage of
update batches
6
5.5
5
4.5
4
3.5
3
80
70
60
50
40
30
20
10
0
01 12 23 34 45
Time
550
500
450
124
16
14
12
10
8
6
4
2
0
DA DA DA DA DA DA DA
Y1 Y2 Y3 Y4 Y5 Y6 Y7
4 8
16
32
64
Number of query servers
128
1.5
1
DA DA DA DA DA DA DA
Y1 Y2 Y3 Y4 Y5 Y6 Y7
99th percentile
latency (ms)
Average latency
(ms)
400
3
2.5
DA DA DA DA DA DA DA
Y1 Y2 Y3 Y4 Y5 Y6 Y7
140
500
120
100
80
60
40
20
400
DA DA DA DA DA DA DA
Y1 Y2 Y3 Y4 Y5 Y6 Y7
| J U LY 201 6 | VO L . 5 9 | NO. 7
300
200
100
0
12
Month
15
18
21
24
200
180
160
140
120
100
80
60
40
20
0
Latency (ms)
Rows returned
(trillions)
Queries
(millions)
600
6000
5000
4000
3000
2000
1000
0
Relative growth
(percent)
650
Queries / Second
strong consistency and transactional correctness guarantees. Itachieves these properties using a batch-oriented
interface, guaranteeing atomicity of updates by introducing transient versioning of data that eliminates the
need for lock-based synchronization of query and update
transactions. Mesa is geo-replicated across multiple datacenters for increased fault-tolerance. Finally, within each
datacenter, Mesas controller/worker framework allows it
to distribute work and dynamically scale the required computation over a large number of machines to provide high
scalability.
Acknowledgments
We would like to thank everyone who has served on the Mesa
team, including former team members Karthik Lakshmi
narayanan, Sanjay Agarwal, Sivasankaran Chandrasekar,
Justin Tolmer, Chip Turner, and Michael Ballbach, for their
substantial contributions to the design and development
of Mesa. We are also grateful to Sridhar Ramaswamy for
providing strategic vision and guidance to the Mesa team.
Finally, we thank the anonymous reviewers, whose feedback
significantly improved the paper.
References
1. Abouzeid, A., Bajda-Pawlikowski, K.,
et al. HadoopDB: An architectural
hybrid of MapReduce and DBMS
technologies for analytical workloads.
PVLDB 2, 1 (2009), 922933.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
ACM Transactions
on Interactive
Intelligent Systems
ACM Transactions
on Computation
Theory
PUBS_halfpage_Ad.indd 1
www.acm.org/pubs
6/7/12 11:38 AM
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T H E ACM
125
CAREERS
Purdue University
Professor of Practice in Computer Science
The Department of Computer Science at Purdue
University is soliciting applications for Professor
of Practice positions at the Assistant, Associate,
or Full Professor level to begin Fall 2016. These
are newly created positions offering three- to fiveyear appointments that are renewable based on
satisfactory performance for faculty with primary
responsibilities in teaching and service. Applicants should hold a PhD in computer science or a
related field, or a Masters degree in computer science or a related discipline and commensurate
experience in teaching or industry. Applicants
should be committed to excellence in teaching,
and should have the ability to teach a broad collection of core courses in the undergraduate
curriculum. Applicants will also be expected to
develop and supervise project courses for undergraduates. Review of applications and candidate
interviews will begin on May 5, 2016, and will continue until the positions are filled.
The Department of Computer Science offers
a stimulating and nurturing educational environment with thriving undergraduate and graduate
126
CO M MUNICATIO NS O F TH E AC M
| J U LY 201 6 | VO L . 5 9 | NO. 7
ADVERTISING
IN CAREER
OPPORTUNITIES
How to Submit a Classified Line Ad:
Send an e-mail to acmmediasales@
acm.org. Please include text, and
indicate the issue/or issues where
the ad will appear, and a contact
name and number.
Estimates: An insertion order will
then be e-mailed back to you. The
ad will by typeset according to
CACM guidelines. NO PROOFS can
be sent. Classified line ads are NOT
commissionable.
Rates: $325.00 for six lines of text,
40 characters per line. $32.50 for
each additional line after the first
six. The MINIMUM is six lines.
Deadlines: 20th of the month/
2 months prior to issue date.
For latest deadline info, please
contact:
[email protected]
Career Opportunities Online:
Classified and recruitment display
ads receive a free duplicate listing
on our website at:
http://jobs.acm.org
Ads are listed for a period
of 30 days.
For More Information Contact:
Associate/Full Professor of
Cyber Security
The cyber security section of the Faculty of
Duration of contract:
Fixed appointment
Salary scale:
5219 to 7599 per
month gross
Qualifications:
A detailed research plan and demonstrated record/potentials;
Ph.D. (Electrical Engineering, Computer Engineering, Computer Science, or
related field);
A minimum relevant research experience of 4 years.
Contact
Prof. R. L. Lagendijk
+31 (0)15-2783731
[email protected]
Applications:
Submit (in English, PDF version) a cover letter, a 2-page research plan, a CV
plus copies of 3 most significant publications, and names of three referees
to: [email protected] (until positions are filled). For more information,
please visit Job Opportunities on http://sist.shanghaitech.edu.cn/
BY AN EYEWITNESS.
JU LY 2 0 1 6 | VO L. 59 | N O. 7 | C OM M U N IC AT ION S OF T H E ACM
127
last byte
DOI:10.1145/2915924
Dennis Shasha
Upstart Puzzles
Chair Games
A GROUP OF people is sitting around your
dinner table with one empty chair. Each
person has a name that begins with a
different letter: A, B, C ... Because you
love puzzles, you ask them to rearrange
themselves to end up in alphabetical
order in a clockwise fashion, with one
empty chair just to the left of the person
whose name begins with A. Each move
involves one person from one chair to
the empty chair k seats away in either
direction. The goal is to minimize the
number of such moves.
Warm-up. Suppose you start with
eight people around the table, with
nine chairs. The last name of each person begins with the letter shown, and
you are allowed to move a person from
one chair to an empty chair three chairs
away in either direction (see Figure 1).
Can you do it in four moves?
Solution to Warm-Up
C moves from 9 to 6
F moves from 3 to 9
C moves from 6 to 3
F moves from 9 to 6
Now here is a more difficult version
of the problem in which the only moves
allowed are four seats away, starting
with the configuration in Figure 2.
Solution
B moves from 6 to 2
F moves from 1 to 6
A moves from 5 to 1
E moves from 9 to 5
Now try to solve these four related
upstart puzzles:
Number of people n and move distance k. Given any number of people
n and move distance k, find the minimum number of moves that would create a clockwise sorted order;
1
A
9
8
128
1
F
CO M MUNICATIO NS O F TH E AC M
9
3
H
7
Figure 2. The goal is again to rearrange the people around the table
to be in clockwise alphabetical order, using as few moves as possible, where each move involves moving a person four seats away
(in either direction) from the empty chair.
Number of people n in a certain arrangement, find best move distance. Given any number of people n, find a move
distance k and the minimum number
of moves of length k that creates a
clockwise sorted order;
Number n of people, one empty chair.
Generalize the problem with n people
and one empty chair to allow movements of distances k1, k2, , kj for
some j < n/2; and
Several empty chairs. Generalize
the problem further to allow several
empty chairs.
| J U LY 201 6 | VO L . 5 9 | NO. 7
G
B
6
A
5