BigData OSFY Nov
BigData OSFY Nov
BigData OSFY Nov
T
he initial growth of data on the Internet was In addition, the advancements in scientific fields coupled
primarily driven by a greater population gaining with the availability of cheaper computing has led to newer
access to the Web. This improved access was applications in the fields of medical sciences, physics,
fuelled by the advent of a newer range of devices like astronomy, genetics, etc, where large volumes of data are
smartphones and tablets. Riding on the first generation of collected and processed to validate hypotheses, and enable
data growth, a second wave of scaled up data production discoveries and inventions.
was unleashed mainly by the social media platforms that The huge growth in data acquisition and storage has led
drove the growth upwards, exponentially. The collaborative us to the next logical phase, which is to process that data to
nature of the information-sharing platforms contributed make sense of it. The need to process this vast volume of data
to a viral growth of the data shared via these platforms. has led to the demand for scalable and parallel systems that
The third wave of data generation is largely being led by process the data at speed and scale. Open source technologies
the proliferation of intelligent connected devices and will are a natural choice for the high performance computing
lead to a scale of data generation that is unprecedented. needed for large scale data processing.
Machine Learning Libraries Machine Learning Servers etc. This data is then cleaned and often checked for any errors
when ingested into the data storage layer. It is then processed,
Apache Mahout MLlib GraphLab
PredictionIO
Object Storage
(Swift, CEPH)
NoSQL (Document Stores)
MongoDB CouchDB Couchbase
GraphDB
RabbitMQ
periodically as newer datasets flow into the system. The
datasets are further used for exploratory analytics to discover
Figure 1: Component stack for Big Data processing unseen intelligence and insights. During the processing and
exploratory processes, the processed datasets are visualised
using visualisation tools to aid data understanding and for
Data from
Multiple communicating to stakeholders.
Sources
This data in the storage layer could be reused by different
stakeholders within an organisation. Big Data is typically
Data Ingestion
Data Predictive Visualization
undefined and most frameworks, as we will see later, have
Processing Modelling
adapted to this aspect of it. In fact, this very feature is
instrumental in the success of a framework.
Let us discuss some of the frameworks and libraries
Data Storage across these different layers.
Figure 2: Big Data processing pipeline The storage and data layer
Lets start with the storage and data layer, which is the
This article aims to provide an overview of the most critical and the foundation of a Big Data stack. Big
frameworks and components available in open source, across Data is typically characterised by its volume, requiring
different layers of a Big Data processing stack. huge and conceptually unlimited storage capacities.
Advances in technology, contributing to cheaper storage
Component architecture stack for and compute resources, have resulted in the emergence
Big Data processing of cluster storage and compute platforms. The platforms
As more and more Big Data, characterised by the three have unlocked the storage limitations and virtually enabled
Vs - volume, velocity and variety - began to be generated unlimited amounts of data storage. These platforms are not
and acquired, different systems started to evolve to tap limited by the traditional paradigms of data modelling and
the vast and diverse potential of this data. Although some schema designs. They are generally schema-free and allow
of the systems converged in terms of the features they the storage of all forms of data (structured, semi-structured
offered, they were all driven by different underlying design and unstructured). This enables the creation of systems that
philosophies and, therefore, offered different alternatives. are more dynamic and which enable analysts to explore the
However, one of the guiding principles to develop data without being limited by preconceived models. In this
enterprise data strategies would be to have a generic data section, we will look at some of the popular cluster storage
storage layer as a data lake, which would allow different frameworks for Big Data.
computing frameworks to work on the data stored for HDFS (https://hadoop.apache.org/): This is a scalable,
different processing use cases, and have the data shared fault-tolerant distributed file system in the Hadoop
across frameworks. Figure 1 illustrates a representational ecosystem. HDFS is scaled by adding commodity servers
architectural stack for Big Data processing. into the clusters. The largest cluster size is known to be
The stack can also be visualised as a pipeline consisting about 4500 nodes in a cluster with up to 128 petabytes of
of multiple stages through which the data is driven, as can data. HDFS supports parallel reading and writing of data.
be seen in Figure 2. The unstructured and often schemaless The bandwidth in an HDFS system scales linearly with the
raw data that is sourced from multiple sources, such as number of nodes. There is built-in redundancy with multiple
transactions, weblogs, open social sites, other linked data copies of data stored in the system. The files are broken
sources and databases, devices and instruments, could be in into blocks and stored as files across the cluster. They are
varying formats such as textual data, images, video, audio, replicated for reliability.
output. As the Map() phase happens across a very large processing, and at the same time integrate batch processing
distributed dataset, spread across a huge cluster of nodes, with real-time stream processing.
it is subsequently run through a Reduce() phase which Apache Storm (https:/storm.apache.org/) is a system for
aggregates the sorted dataset, coming in from multiple map processing continuous streams of data in real-time. It is highly
nodes. This framework, along with the underlying HDFS scalable, fault tolerant, and ensures the notion of guaranteed
system, enables processing of very large datasets running processing so that no events are lost. While Hadoop provides
into Petabytes, spread across thousands of nodes. the framework for batch processing of data, Storm does the
Apache Flink (https://flink.apache.org/) is a data same for streaming event data.
processing system that combines the scalability and power It provides Directed Acyclic Graph (DAG) processing
of the Hadoop HDFS layer along with the declarations and for defining the data processing pipeline or topology using a
optimisations that are the cornerstone of relational database notion of spouts (input data sources) and bolts. Streams are
systems. Flink provides a runtime system, which is an tuples that flow through these processing pipelines.
alternative to the Hadoop MapReduce framework. A Storm cluster consists of three components:
Apache Tez (https://tez.apache.org/) is a distributed Nimbus, which runs on the master node and is
data processing engine that sits on top of Yarn (Hadoop 2.0 responsible for distribution of work amongst the worker
ecosystem). Tez models the data processing workflows as processes.
Distributed Acyclic Graphs (DAGs). With this distinctive Supervisor daemons run on the worker nodes, listen to
feature, Tez allows developers to intuitively model their the tasks assigned, and manage the worker processes to
complex data processing jobs as a pipeline of tasks, while start/stop them as needed, to get the work done.
leveraging the underlying resource management capabilities Zookeeper handles the co-ordination between Nimbus
of the Hadoop 2.0 ecosystem. and Supervisors, and maintains the state for fault-
Apache Spark (https://spark.apache.org/) is a tolerance.
distributed execution engine for Big Data processing that
provides efficient abstractions to process large datasets in Higher level languages for analytics and querying
memory. While MapReduce on Yarn provides abstraction As cluster programming frameworks evolved to solve the
for using a clusters computational resources, it lacks Big Data processing problems, another problem started to
efficiency for iterative algorithms and interactive data emerge as more and more real life use cases were attempted.
miningalgorithms that need to reuse data in between Programming using these computing frameworks got
computations. Spark implements in-memory fault increasingly complex, and became difficult to maintain. The
tolerant data abstractions in the form of RDDs (Resilient skill scalability became another matter of concern, as there
Distributed Datasets), which are parallel data structures were a lot of people available with domain expertise familiar
stored in memory. RDDs provide fault-tolerance by with skills such as SQL and scripting. As a result, higher
tracking transformations (lineage) rather than changing level programming abstractions for the cluster computing
actual data. In case a partition has to be recovered after frameworks began to emerge, that abstracted the low level
loss, the transformations need to be applied on just that programming APIs. Some of these frameworks are discussed
dataset. This is far more efficient than replicating datasets in this section.
across nodes for fault tolerance, and this is supposedly Hive (https:/hive.apache.org/) and Pig (https:/pig.
100x faster than Hadoop MR. apache.org/) are higher level language implementations for
Spark also provides a unified framework for batch MapReduce. The language interface internally generates
processing, stream data processing, interactive data mining MapReduce programs from the queries written in the high
and includes APIs in Java, Scala and Python. It provides an level languages, thereby abstracting the underlying nitty-gritty
interactive shell for faster querying capabilities, libraries for of MapReduce and HDFS.
machine learning (MLLib and GraphX), an API for graph data While Pig implements PigLatin, which is a procedural-
processing, SparkSQL (a declarative query language), and like language interface, Hive provides the Hive Query
SparkStreaming (a streaming API for stream data processing). Language (HQL), which is a declarative and SQL-like
SparkStreaming is a system for processing event language interface.
streams in real-time. SparkStreaming treats streaming as Pig lends itself well to writing data processing pipelines
processing of datasets in micro-batches. The incoming for iterative processing scenarios. Hive, with a declarative
stream is divided into batches of configured number of SQL-like language, is more usable for ad hoc data querying,
seconds. These batches are fed into the underlying Spark explorative analytics and BI.
system and are processed the same way as in the Spark BlinkDB (http:/blinkdb.org/) is a recent entrant into
batch programming paradigm. This makes it possible the Big Data processing ecosystem. It provides a platform
to achieve the very low latencies needed for stream for interactive query processing that supports approximate
started to evolve. Also, HDFS was widely accepted for Big provide a business value at the end. Machine learning
Data storage. It did not make sense to replicate data for other programs the systems to learn or process large amounts
frameworks. The Hadoop community worked on re-hauling of data, and be able to apply the learnings to predict
the platform to take it beyond MapReduce. The result of outcomes on an unseen input dataset. Machine learning
this was Hadoop 2.0, which separated resource management systems have enabled several real world use cases
from application management. The resource management such as targeted ad campaigns, recommendation
system was named Yarn. engines, next best offer/action scenarios, self-learning
Yarn is again a master-slave architecture, with the autonomous systems, etc. We will look at a few of the
resource manager acting as a master that manages the frameworks in this space.
resource assignments to the different applications on the Apache Mahout (http://mahout.apache.org/) aims
cluster. The slave component is called the NodeManager, to provide a scalable machine learning platform with the
runs on every node in the cluster, and is responsible implementation of several algorithms out-of-the-box,
for launching the compute containers needed by the and provides a framework for implementing custom
application. algorithms as well. Although, Apache Mahout was one
The ApplicationMaster is the framework-specific of the earliest ML libraries, it was originally written
entity. It is responsible for negotiating resources from the for the MapReduce programming paradigm. However,
ResourceManager and working with the node manager to MapReduce was not very well suited for the iterative
submit and monitor the application tasks. nature of machine learning algorithms and hence did not
This decoupling allowed other frameworks to work find great success. However, after Spark started gaining
alongside MapReduce, accessing and sharing data on the momentum, Mahout has been ported to Apache Spark,
same cluster, thereby helping to improve cluster utilisation. rebranded as Spark MLLib, and has been discontinued
Apache Mesos (http:/mesos.apache.org/) is a generic on Hadoop MapReduce.
cluster resource management framework that can manage Spark MLLib (https:/spark.apache.org/mllib/) is a
every resource in the data centre. Mesos differs from scalable machine learning platform, which is written on
Yarn in the way the scheduling works. Mesos implements top of Spark and is available as an extension of the Spark
a two-level scheduling mechanism, where the master Core execution engine. Spark MLLib has an advantage
makes resource offers to the framework schedulers, and as it has been implemented as a native extension to Spark
the frameworks decide whether to accept or decline them. Core. Spark MLLib has several algorithms written for ML
This model enables Mesos to become very scalable and problems such as classification, regression, collaborative
generic, and allows frameworks to meet specific goals filtering, clustering, decomposition, etc.
such as data locality really well. PredictionIO (http:/prediction.io/) is a scalable
Mesos is a master/slave architecture with the Mesos machine learning server that provides a framework
master running on one of the nodes, and is shadowed enabling faster prototyping and productionising of
by several standby masters that can takeover in case machine learning applications. It is built on Apache
of a failure. The master manages the slave processes Spark and leverages Spark MLLib to provide
on the cluster nodes and the frameworks that run tasks implementation templates of several machine learning
on the nodes. The framework running on Mesos has algorithms. It provides an interface to expose the trained
two components: a scheduler that registers with the prediction model as a service through an event server
master, and the framework executor that launches on based architecture. It also provides a means to persist the
the Mesos slave nodes. In Mesos, the slave nodes report trained models in a distributed environment. The events
to the master about the available resources on offer. generated are collected in real-time and can be used to
The Mesos master looks up the allocation policies and retrain the model as a batch job. The client application
offers the resources to the framework as per the policy. can query the service over REST APIs and get the
The framework, based on its goal and the tasks that predicted results back in a JSON response.
need to be run, accepts the offer completely, partially
or can even decline it. It sends back a response with the
acceptance and the tasks to be run, if any. The Mesos By: Subhash Bylaiah and Dr B. Thangaraju
master forwards the tasks to the corresponding slaves, Subhash Bylaiah is a senior technologist with over 15 years of
which allocate the offered resources to the executor, and experience in database systems and application development.
His current assignment is with Wipro Technologies. He can be
the executor in turn launches the tasks. reached at [email protected].
Dr B. Thangaraju is an open source software (OSS) evangelist,
Machine learning libraries who works at Talent Transformation, Wipro Technologies,
Big Data would not be worth the effort if it didnt Bangalore and he can be reached at [email protected].