Big Data Processing
Big Data Processing
Big Data Processing
The quantities, characters, or symbols on which operations are performed by a computer, which
may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or
mechanical recording media.
Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is
a data with so large size and complexity that none of traditional data management tools can store it
or process it efficiently. Big data is also a data but with huge size.
The New York Stock Exchange is an example of Big Data that generates about one terabyte of new
trade data per day.
Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases of social media
site Facebook, every day. This data is mainly generated in terms of photo and video uploads,
message exchanges, putting comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many
thousand flights per day, generation of data reaches up to many Petabytes.
1. Structured
2. Unstructured
3. Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
‘structured’ data. Over the period of time, talent in computer science has achieved greater success
in developing techniques for working with such kind of data (where the format is well known in
advance) and also deriving value out of it. However, nowadays, we are foreseeing issues when a size
of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes.
Looking at these figures one can easily understand why the name Big Data is given and imagine the
challenges involved in its storage and processing.
Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to the
size being huge, un-structured data poses multiple challenges in terms of its processing for deriving
value out of it. A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc. Now day organizations have wealth of data
available with them but unfortunately, they don’t know how to derive value out of it since this data
is in its raw form or unstructured format.
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g., a table definition in relational DBMS.
Example of semi-structured data is a data represented in an XML file.
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Data Growth over the years
Please note that web application data, which is unstructured, consists of log files, transaction
history files etc. OLTP systems are built to work with structured data wherein data is stored in
relations (tables).
Volume
Variety
Velocity
Variability
(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very
crucial role in determining value out of data. Also, whether a particular data can actually be
considered as a Big Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is one
characteristic which needs to be considered while dealing with Big Data solutions.
Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases were the only sources of data considered by most
of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs,
audio, etc. are also being considered in the analysis applications. This variety of unstructured data
poses certain issues for storage, mining and analyzing data.
(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business processes,
application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is
massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.
Ability to process Big Data in DBMS brings in multiple benefits, such as-
Access to social data from search engines and sites like Facebook, twitter are enabling organizations
to fine tune their business strategies.
Traditional customer feedback systems are getting replaced by new systems designed with Big Data
technologies. In these new systems, Big Data and natural language processing technologies are being
used to read and evaluate consumer responses.
Big Data technologies can be used for creating a staging area or landing zone for new data before
identifying what data should be moved to the data warehouse. In addition, such integration of Big
Data technologies and data warehouse helps an organization to offload infrequently accessed data.
Big Data analytics is a process used to extract meaningful insights, such as hidden patterns, unknown
correlations, market trends, and customer preferences. Big Data analytics provides various
advantages—it can be used for better decision making, preventing fraudulent activities, among other
things.
In today’s world, Big Data analytics is fueling everything we do online—in every industry.
Take the music streaming platform Spotify for example. The company has nearly 96 million users
that generate a tremendous amount of data every day. Through this information, the cloud-based
platform automatically generates suggested songs—through a smart recommendation engine—
based on likes, shares, search history, and more. What enables this is the techniques, tools, and
frameworks that are a result of Big Data analytics.
If you are a Spotify user, then you must have come across the top recommendation section, which is
based on your likes, past history, and other things. Utilizing a recommendation engine that leverages
data filtering tools that collect data and then filter it using algorithms works. This is what Spotify
does.
1. Risk Management
Use Case: Banco de Oro, a Phillippine banking company, uses Big Data analytics to identify
fraudulent activities and discrepancies. The organization leverages it to narrow down a list of
suspects or root causes of problems.
2. Product Development and Innovations
Use Case: Rolls-Royce, one of the largest manufacturers of jet engines for airlines and armed forces
across the globe, uses Big Data analytics to analyze how efficient the engine designs are and if there
is any need for improvements.
Use Case: Starbucks uses Big Data analytics to make strategic decisions. For example, the company
leverages it to decide if a particular location would be suitable for a new outlet or not. They will
analyze several different factors, such as population, demographics, accessibility of the location, and
more.
Use Case: Delta Air Lines uses Big Data analysis to improve customer experiences. They monitor
tweets to find out their customers’ experience regarding their journeys, delays, and so on. The
airline identifies negative tweets and does what’s necessary to remedy the situation. By publicly
addressing these issues and offering solutions, it helps the airline build good customer relations.
Stage 1 - Business case evaluation - The Big Data analytics lifecycle begins with a business
case, which defines the reason and goal behind the analysis.
Stage 2 - Identification of data - Here, a broad variety of data sources are identified.
Stage 3 - Data filtering - All of the identified data from the previous stage is filtered here to
remove corrupt data.
Stage 4 - Data extraction - Data that is not compatible with the tool is extracted and then
transformed into a compatible form.
Stage 5 - Data aggregation - In this stage, data with the same fields across different datasets
are integrated.
Stage 6 - Data analysis - Data is evaluated using analytical and statistical tools to discover
useful information.
Stage 7 - Visualization of data - With tools like Tableau, Power BI, and QlikView, Big Data
analysts can produce graphic visualizations of the analysis.
Stage 8 - Final analysis result - This is the last step of the Big Data analytics lifecycle, where
the final results of the analysis are made available to business stakeholders who will take
action.
1. Descriptive Analytics
This summarizes past data into a form that people can easily read. This helps in creating reports, like
a company’s revenue, profit, sales, and so on. Also, it helps in the tabulation of social media metrics.
Use Case: The Dow Chemical Company analyzed its past data to increase facility utilization across its
office and lab space. Using descriptive analytics, Dow was able to identify underutilized space. This
space consolidation helped the company save nearly US $4 million annually.
2. Diagnostic Analytics
This is done to understand what caused a problem in the first place. Techniques like drill-down, data
mining, and data recovery are all examples. Organizations use diagnostic analytics because they
provide an in-depth insight into a particular problem.
Use Case: An e-commerce company’s report shows that their sales have gone down, although
customers are adding products to their carts. This can be due to various reasons like the form didn’t
load correctly, the shipping fee is too high, or there are not enough payment options available. This
is where you can use diagnostic analytics to find the reason.
3. Predictive Analytics
This type of analytics looks into the historical and present data to make predictions of the future.
Predictive analytics uses data mining, AI, and machine learning to analyze current data and make
predictions about the future. It works on predicting customer trends, market trends, and so on.
Use Case: PayPal determines what kind of precautions they have to take to protect their clients
against fraudulent transactions. Using predictive analytics, the company uses all the historical
payment data and user behavior data and builds an algorithm that predicts fraudulent activities.
4. Prescriptive Analytics
This type of analytics prescribes the solution to a particular problem. Perspective analytics works
with both descriptive and predictive analytics. Most of the time, it relies on AI and machine learning.
Use Case: Prescriptive analytics can be used to maximize an airline’s profit. This type of analytics is
used to build an algorithm that will automatically adjust the flight fares based on numerous factors,
including customer demand, weather, destination, holiday seasons, and oil prices.
Spark - used for real-time processing and analyzing large amounts of data
Here are some of the sectors where Big Data is actively used:
Ecommerce - Predicting customer trends and optimizing prices are a few of the ways e-
commerce uses Big Data analytics
Marketing - Big Data analytics helps to drive high ROI marketing campaigns, which result in
improved sales
Education - Used to develop new and improve existing courses based on market
requirements
Healthcare - With the help of a patient’s medical history, Big Data analytics is used to predict
how likely they are to have health issues
Media and entertainment - Used to understand the demand of shows, movies, songs, and
more to deliver a personalized recommendation list to its users
Banking - Customer income and spending patterns help to predict the likelihood of choosing
various banking offers, like loans and credit cards
Telecommunications - Used to forecast network capacity and improve customer experience
Government - Big Data analytics helps governments in law enforcement, among other
things.
What is Hadoop?
Apache Hadoop is an open-source framework that is used to efficiently store and process large
datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to
store and process the data, Hadoop allows clustering multiple computers to analyse massive
datasets in parallel more quickly.
Hadoop Distributed File System (HDFS) – A distributed file system that runs on standard or
low-end hardware. HDFS provides better data throughput than traditional file systems, in
addition to high fault tolerance and native support of large datasets.
Yet Another Resource Negotiator (YARN) – Manages and monitors cluster nodes and
resource usage. It schedules jobs and tasks.
MapReduce – A framework that helps programs do the parallel computation on data. The
map task takes input data and converts it into a dataset that can be computed in key value
pairs. The output of the map task is consumed by reduce tasks to aggregate output and
provide the desired result.
Hadoop Common – Provides common Java libraries that can be used across all modules.
Hadoop History
Hadoop was started with Doug Cutting and Mike Cafarella in the year 2002 when they both started
to work on Apache Nutch project. Apache Nutch project was the process of building a search engine
system that can index 1 billion pages. After a lot of research on Nutch, they concluded that such a
system will cost around half a million dollars in hardware, and along with a monthly running cost of
$30, 000 approximately, which is very expensive. So, they realized that their project architecture will
not be capable enough to the workaround with billions of pages on the web. So, they were looking
for a feasible solution which can reduce the implementation cost as well as the problem of storing
and processing of large datasets.
In 2003, they came across a paper that described the architecture of Google’s distributed file system,
called GFS (Google File System) which was published by Google, for storing the large data sets. Now
they realize that this paper can solve their problem of storing very large files which were being
generated because of web crawling and indexing processes. But this paper was just the half solution
to their problem.
In 2004, Google published one more paper on the technique MapReduce, which was the solution of
processing those large datasets. Now this paper was another half solution for Doug Cutting and Mike
Cafarella for their Nutch project. These both techniques (GFS & MapReduce) were just on white
paper at Google. Google didn’t implement these two techniques. Doug Cutting knew from his work
on Apache Lucene (It is a free and open-source information retrieval software library, originally
written in Java by Doug Cutting in 1999) that open-source is a great way to spread the technology to
more people. So, together with Mike Cafarella, he started implementing Google’s techniques (GFS &
MapReduce) as open-source in the Apache Nutch project.
In 2005, Cutting found that Nutch is limited to only 20-to-40 node clusters. He soon realized two
problems:
(a) Nutch wouldn’t achieve its potential until it ran reliably on the larger clusters
(b) And that was looking impossible with just two people (Doug Cutting & Mike Cafarella).
The engineering task in Nutch project was much bigger than he realized. So, he started to find a job
with a company who is interested in investing in their efforts. And he found Yahoo!.Yahoo had a
large team of engineers that was eager to work on this there project.
So, in 2006, Doug Cutting joined Yahoo along with Nutch project. He wanted to provide the world
with an open-source, reliable, scalable computing framework, with the help of Yahoo. So at Yahoo
first, he separates the distributed computing parts from Nutch and formed a new project Hadoop
(He gave name Hadoop it was the name of a yellow toy elephant which was owned by the Doug
Cutting’s son. and it was easy to pronounce and was the unique word.) Now he wanted to make
Hadoop in such a way that it can work well on thousands of nodes. So, with GFS and MapReduce, he
started to work on Hadoop.
In 2007, Yahoo successfully tested Hadoop on a 1000 node cluster and start using it.
In January of 2008, Yahoo released Hadoop as an open-source project to ASF (Apache Software
Foundation). And in July of 2008, Apache Software Foundation successfully tested a 4000-node
cluster with Hadoop.
In 2009, Hadoop was successfully tested to sort a PB (PetaByte) of data in less than 17 hours for
handling billions of searches and indexing millions of web pages. And Doug Cutting left the Yahoo
and joined Cloudera to fulfill the challenge of spreading Hadoop to other industries.
Apache Hadoop was born to enhance the usage and solve major issues of big data. The web media
was generating loads of information on a daily basis, and it was becoming very difficult to manage
the data of around one billion pages of content. In order of revolutionary, Google invented a new
methodology of processing data popularly known as MapReduce. Later after a year Google
published a white paper of Map Reducing framework where Doug Cutting and Mike Cafarella,
inspired by the white paper and thus created Hadoop to apply these concepts to an open-source
software framework that supported the Nutch search engine project. Considering the original case
study, Hadoop was designed with much simpler storage infrastructure facilities.
Apache Hadoop is the most important framework for working with Big Data. Hadoop biggest
strength is scalability. It upgrades from working on a single node to thousands of nodes without any
issue in a seamless manner.
The different domains of Big Data means we are able to manage the data’s are from videos, text
medium, transactional data, sensor information, statistical data, social media conversations, search
engine queries, ecommerce data, financial information, weather data, news updates, forum
discussions, executive reports, and so on
Google’s Doug Cutting and his team members developed an Open-Source Project namely known
as HADOOP which allows you to handle the very large amount of data. Hadoop runs the applications
on the basis of MapReduce where the data is processed in parallel and accomplish the entire
statistical analysis on large amount of data.
It is a framework which is based on java programming. It is intended to work upon from a single
server to thousands of machines each offering local computation and storage. It supports the large
collection of data set in a distributed computing environment.
Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on
commodity machines, providing very high aggregate bandwidth across the cluster;
It is a utility or feature that comes with a Hadoop distribution that allows developers or
programmers to write the Map-Reduce program using different programming languages like Ruby,
Perl, Python, C++, etc. We can use any language that can read from the standard input(STDIN) like
keyboard input and all and write using standard output(STDOUT). We all know the Hadoop
Framework is completely written in java but programs for Hadoop are not necessarily need to code
in Java programming language. feature of Hadoop Streaming is available since Hadoop version
0.14.1.
How Hadoop Streaming Works
In the above example image, we can see that the flow shown in a dotted block is a basic MapReduce
job. In that, we have an Input Reader which is responsible for reading the input data and produces
the list of key-value pairs. We can read data in .csv format, in delimiter format, from a database
table, image data (.jpg, .png), audio data etc. The only requirement to read all these types of data is
that we have to create a particular input format for that data with these input readers. The input
reader contains the complete logic about the data it is reading. Suppose we want to read an image
then we have to specify the logic in the input reader so that it can read that image data and finally it
will generate key-value pairs for that image data.
If we are reading an image data then we can generate key-value pair for each pixel where the key
will be the location of the pixel and the value will be its color value from (0-255) for a colored image.
Now this list of key-value pairs is fed to the Map phase and Mapper will work on each of these key-
value pair of each pixel and generate some intermediate key-value pairs which are then fed to the
Reducer after doing shuffling and sorting then the final output produced by the reducer will be
written to the HDFS. These are how a simple Map-Reduce job works.
Now let’s see how we can use different languages like Python, C++, Ruby with Hadoop for execution.
We can run this arbitrary language by running them as a separate process. For that, we will create
our external mapper and run it as an external separate process. These external map processes are
not part of the basic MapReduce flow. This external mapper will take input from STDIN and produce
output to STDOUT. As the key-value pairs are passed to the internal mapper the internal mapper
process will send these key-value pairs to the external mapper where we have written our code in
some other language like with python with help of STDIN. Now, these external mappers process
these key-value pairs and generate intermediate key-value pairs with help of STDOUT and send it to
the internal mappers.
Similarly, Reducer does the same thing. Once the intermediate key-value pairs are processed
through the shuffle and sorting process they are fed to the internal reducer which will send these
pairs to external reducer process that are working separately through the help of STDIN and gathers
the output generated by external reducers with help of STDOUT and finally the output is stored to
our HDFS.
Some Hadoop Streaming Commands
Option Description
This is how Hadoop Streaming works on Hadoop which is by default available in Hadoop. We are just
utilizing this feature by making our external mapper and reducers. Now we can see how powerful
feature is Hadoop streaming. Anyone can write his code in any language of his own choice.
Hadoop Ecosystem
Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to solve the
big data problems. It includes Apache projects and various commercial tools and solutions. There
are four major elements of Hadoop i.e., HDFS, MapReduce, YARN, and Hadoop Common. Most of
the tools or solutions are used to supplement or support these major elements. All these tools work
collectively to provide services such as absorption, analysis, storage and maintenance of data etc.
HBase: NoSQL Database
Zookeeper: Managing cluster
Oozie: Job Scheduling
Note: Apart from the above-mentioned components, there are many other components too that are
part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the beauty of Hadoop
that it revolves around data and hence making its synthesis easier.
HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing
large data sets of structured or unstructured data across various nodes and thereby
maintaining the metadata in the form of log files.
1. Name node
2. Data Node
Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These data
nodes are commodity hardware in the distributed environment. Undoubtedly, making
Hadoop cost effective.
HDFS maintains all the coordination between the clusters and hardware, thus working at the
heart of the system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.
1. Resource Manager
2. Nodes Manager
3. Application Manager
Resource manager has the privilege of allocating resources for the applications in a system
whereas Node managers work on the allocation of resources such as CPU, memory,
bandwidth per machine and later on acknowledges the resource manager. Application
manager works as an interface between the resource manager and node manager and
performs negotiations as per the requirement of the two.
MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it possible to
carry over the processing’s logic and helps to write applications which transform big data
sets into a manageable one.
MapReduce makes the use of two functions i.e., Map () and Reduce () whose task is:
1. Map () performs sorting and filtering of data and thereby organizing them in the
form of group. Map generates a key-value pair-based result which is later on
processed by the Reduce () method.
2. Reduce (), as the name suggests does the summarization by aggregating the mapped
data. In simple, reduce () takes the output generated by Map () as input and
combines those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based
language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just
the way Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major segment of
the Hadoop Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading and writing of large
data sets. However, its query language is called as HQL (Hive Query Language).
It is highly scalable as it allows real-time processing and batch processing both. Also, all the
SQL datatypes are supported by Hive thus, making the query processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC
Drivers and HIVE Command Line.
JDBC, along with ODBC drivers work on establishing the data storage permissions and
connection whereas HIVE Command line helps in the processing of queries.
Mahout:
Apache Spark:
It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and visualization, etc.
It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
Spark is best suited for real-time data whereas Hadoop is best suited for structured data or
batch processing, hence both are used in most of the companies interchangeably.
Apache HBase:
It’s a NoSQL database which supports all kinds of data and thus capable of handling anything
of Hadoop Database. It provides capabilities of Google’s BigTable, thus able to work on Big
Data sets effectively.
At times where we need to search or retrieve the occurrences of something small in a huge
database, the request must be processed within a short quick span of time. At such times,
HBase comes handy as it gives us a tolerant way of storing limited data
Other Components: Apart from all of these, there are some other components too that carry out a
huge task in order to make Hadoop capable of processing large datasets. They are as follows:
Solr, Lucene: These are the two services that perform the task of searching and indexing
with the help of some java libraries, especially Lucene is based on Java which allows spell
check mechanism, as well. However, Lucene is driven by Solr.
Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them
together as a single unit. There are two kinds of jobs. i.e Oozie workflow and Oozie
coordinator jobs. Oozie workflow is the jobs that need to be executed in a sequentially
ordered manner whereas Oozie Coordinator jobs are those that are triggered when some
data or external stimulus is given to it.
IBM, a US-based computer hardware and software manufacturer, had implemented a Big Data
strategy.
Where the company offered solutions to store, manage, and analyze the huge amounts of data
generated daily and equipped large and small companies to make informed business decisions.
The company believed that its Big Data and analytics products and services would help its clients
become more competitive and drive growth.
Issues:
· Understand the concept of Big Data and its importance to large, medium, and small companies
in the current industry scenario.
· Understand the need for implementing a Big Data strategy and the various issues and
challenges associated with this.
· Explore ways in which IBM’s Big Data strategy could be improved further.
Introduction to Infosphere:
InfoSphere Information Server provides a single platform for data integration and governance.
The components in the suite combine to create a unified foundation for enterprise information
architectures, capable of scaling to meet any information volume requirements.
You can use the suite to deliver business results faster while maintaining data quality and integrity
throughout your information landscape.
InfoSphere Information Server helps your business and IT personnel collaborate to understand the
meaning, structure, and content of information across a wide variety of sources.
By using InfoSphere Information Server, your business can access and use information in new ways
to drive innovation, increase operational efficiency, and lower risk.
BigInsights:
BigInsights is a software platform for discovering, analyzing, and visualizing data from disparate
sources.
The flexible platform is built on an Apache Hadoop open-source framework that runs in parallel on
commonly available, low-cost hardware.
Big Sheets:
BigSheets is a browser-based analytic tool included in the Infosphere. BigInsights Console that you
use to break large amounts of unstructured data into consumable, situation-specific business
contexts.
These deep insights help you to filter and manipulate data from sheets even further.