Showing posts with label protocol buffers. Show all posts
Showing posts with label protocol buffers. Show all posts

RProtoBuf & HistogramTools: Statistical Analysis Tools for Large Data Sets

Thursday, October 10, 2013

At Google, building, managing and safely securing some of the world’s largest storage systems requires complex analysis of filesystem metadata. This is an important part of making sure that the information stored within those systems is quickly accessible and always secure. We're always looking for ways to make our data storage systems more efficient, and often times, this requires understanding the age, size and access patterns of the data stored, the failure rates of servers and disks, and more. You can imagine how complex this becomes with each new data center added.

Given the number of files and servers that are relevant for this performance analysis, we bin the metadata into a compact histogram form. We use these output histograms for many purposes, such as (i) building Markov models of data availability, (ii) statistical forecasting of resource usage, and (iii) formulating and solving optimization problems to determine optimal allocation of flash devices.

We rely on several open source tools to make our work easier. The most common tool we use for statistical analysis of the performance, availability, and resource needs of our internal systems is the R programming language. We’ve released two package updates that make R particularly suitable for interacting with other distributed systems.
  • RProtoBuf is an R package for Google’s Protocol Buffer library that allows one to define simple data structures with intuitive getter and setter methods. These data structures can be serialized into an extremely compact format for sending to other distributed systems. Recent releases include improved support for 64-bit integers, protocol buffer extensions, and more.
  • HistogramTools is a new R package I have released that uses RProtoBuf to read in a compact protocol buffer representation of binned data and includes a number of helpful functions for manipulating, plotting, and measuring the statistical information loss due to the binning. In addition to protocol buffers, it also supports importing aggregate performance data directly from DTrace output.
Both packages are available on CRAN and include extensive documentation and examples.

If you're interested to learn more, we have shared some of our research findings at conferences such as OSDI, USENIX ATC, and JSM.

By Murray Stokely, Storage Analytics Team Lead

Bringing 64-bit data to R

Thursday, November 24, 2011

The R programming language has become one of the standard tools for statistical data analysis and visualization, and is widely used by Google and many others. The language includes extensive support for working with vectors of integers, numerics (doubles), and many other types, but has lacked support for 64-bit integers. Romain Francois has recently uploaded the int64 package to CRAN as well as updated versions of the Rcpp and RProtobuf packages to make use of this package. Inside Google, this is important when interacting with other engineering systems such as Dremel and Protocol Buffers, where our engineers and quantitative analysts often need to read in 64-bit quantities from a datastore and perform statistical analysis inside R.

Romain has taken the approach of storing int64 vectors as S4 objects with a pair of R’s default 32-bit integers to store the high and low-order bits. Almost all of the standard arithmetic operations built into the R language have been extended to work with this new class. The design is such that the necessary bit-artihmetic is done behind the scenes in high-performance C++ code, but the higher-level R functions work transparently. This means, for example, that you can:

• Perform arithmetic operations between 64-bit operands or between int64 objects and integer or numeric types in R.
• Read and write CSV files including 64-bit values by specifying int64 as a colClasses argument to read.csv and write.csv (with int64 version 1.1).
• Load and save 64-bit types with the built-in serialization methods of R.
• Compute summary statistics of int64 vectors, such as max, min, range, sum, and the other standard R functions in the Summary Group Generic.

For even higher levels of precision, there is also the venerable and powerful GNU Multiple Precision Arithmetic Library and the R GMP package on CRAN, although Romain’s new int64 package is a better fit for the 64-bit case.

We’ve had to work around the lack of 64-bit integers in R for several years at Google. And after several discussions with Romain, we were very happy to be able to fund his development of this package to solve the problem not just for us, but for the broader open-source community as well. Enjoy!

By Murray Stokely, Software Engineer, Infrastructure Quantitative Team

Introducing protobuf-dt: An Eclipse editor for Protocol Buffers

Tuesday, August 2, 2011

Protobuf-dt is a new Eclipse plug-in for editing protocol buffer descriptor files. It provides all the features you would expect from an IDE editor, such as syntax highlighting, an outline view, content assist and hyperlinking.

Protobuf-dt also provides protocol buffer-specific features like automatic generation of numeric tags, Javadoc-like documentation and integration with protoc.

This plug-in has been heavily tested by Google employees in many different projects through seven internal releases. For more information, please visit our project page, join the mailing list and download the code.

To install the plugin, please follow these instructions.

By Alex Ruiz, Google Engineering Tools

Protocol Buffers: Google's Data Interchange Format

Monday, July 7, 2008

At Google, our mission is organizing all of the world's information. We use literally thousands of different data formats to represent networked messages between servers, index records in repositories, geospatial datasets, and more. Most of these formats are structured, not flat. This raises an important question: How do we encode it all?

XML? No, that wouldn't work. As nice as XML is, it isn't going to be efficient enough for this scale. When all of your machines and network links are running at capacity, XML is an extremely expensive proposition. Not to mention, writing code to work with the DOM tree can sometimes become unwieldy.

Do we just write the raw bytes of our in-memory data structures to the wire? No, that's not going to work either. When we roll out a new version of a server, it almost always has to start out talking to older servers. New servers need to be able to read the data produced by old servers, and vice versa, even if individual fields have been added or removed. When data on disk is involved, this is even more important. Also, some of our code is written in Java or Python, so we need a portable solution.

Do we write hand-coded parsing and serialization routines for each data structure? Well, we used to. Needless to say, that didn't last long. When you have tens of thousands of different structures in your code base that need their own serialization formats, you simply cannot write them all by hand.

Instead, we developed Protocol Buffers. Protocol Buffers allow you to define simple data structures in a special definition language, then compile them to produce classes to represent those structures in the language of your choice. These classes come complete with heavily-optimized code to parse and serialize your message in an extremely compact format. Best of all, the classes are easy to use: each field has simple "get" and "set" methods, and once you're ready, serializing the whole thing to – or parsing it from – a byte array or an I/O stream just takes a single method call.

OK, I know what you're thinking: "Yet another IDL?" Yes, you could call it that. But, IDLs in general have earned a reputation for being hopelessly complicated. On the other hand, one of Protocol Buffers' major design goals is simplicity. By sticking to a simple lists-and-records model that solves the majority of problems and resisting the desire to chase diminishing returns, we believe we have created something that is powerful without being bloated. And, yes, it is very fast – at least an order of magnitude faster than XML.

And now, we're making Protocol Buffers available to the Open Source community. We have seen how effective a solution they can be to certain tasks, and wanted more people to be able to take advantage of and build on this work. Take a look at the documentation, download the code and let us know what you think.