Unit I - Big Data Programming

INTRODUCTION TO BIG DATA AND HADOOP
UNIT I
Types of Digital Data, Introduction to Big Data, Big Data Analytics, History of Hadoop,
Apache Hadoop, Analysing Data with Unix tools, Analysing Data with Hadoop, Hadoop
Streaming, Hadoop Echo System, IBM Big Data Strategy, Introduction to Infosphere Big
Insights and Big Sheets.
Types of Digital Data

Three types of big data exist:
▪ Structured Data
▪ Unstructured Data
▪ Semi-Structured Data
While these three words are theoretically relevant to all levels ofanalytics, they are critical in
the context of big data. Understanding where raw data originates and how it must be
processed prior to analysis becomes even more critical when dealing with large amounts of
big data. Because there is so muchof it, information extraction must be efficient to justify the
effort.
The data's structure dictates how to work with it and what insights it may provide. Before
data can be evaluated, it mustundergo an extract, transform, and load (ETL) process. It is a literal
term: data is gathered, structured so that an application can access it, and then saved for
later use. Each data structure requires a unique ETL method.
Let us define what they signify and how they relate to big dataanalytics.
Structured Data
The most straightforward kind of data to deal with is structureddata. It is highly ordered, with
parameters defining its size.
Consider spreadsheets; each item of data is organised into rows and columns. Certain variables
specify specific components that are readily discoverable.
It is comprised of all our quantitative data:
▪ Age
▪ Billing
▪ Contact
▪ Address
▪ Expenses
▪ Numbers of debit/credit cards
Because structured data is already composed of concrete numbers, it makes data collection
and sorting considerably moreuncomplicated for software.
Structured data is organised according to schemas; road maps todata points effectively. These
schemas define the location andmeaning of each item.
A payroll database will include employee identifying information,pay rates, hours worked,
and the way money is distributed, among other things. Each of these dimensions will be
defined bythe schema for the application that uses it. The software will not have to sift
through data to ascertain its meaning; it can immediately gather and analyse it.
Unstructured Data
Not all data is as neatly packaged and organised with use instructions as structured data.
The general view is that no morethan 20% of all data is organised.
Thus, what constitutes the remaining four-fifths of all availableinformation? Due to the
lack of structure, naturally, this is unstructured data.
We may deduce why it comprises such a large portion of the current data library. Almost
all activities performed on acomputer create unstructured data. Nobody is transcribing their
phone conversations or categorising each tet they send.
While organised data saves time throughout an analytical process, the time and effort
required to make unstructured data readable are inconvenient.
The ETL method is straightforward for structured data. It is cleaned and checked before
the information is loaded into a database during the converting step. Hover, with
unstructureddata, this second step becomes far more challenging.
To get anything approaching valuable data, the dataset must beinterpretable. Hover, the work
might be much more beneficialthan the effortless alternative to unstructured data processing.
Inathletics, as they say, we get what we put in.
Semi-structured data:
Semi-structured data straddles the structured and unstructured worlds. Typically, this
converts to unstructured data withassociated metadata. This may be intrinsic data acquired
during the collection process, such as time, location, device ID stamp, email address, or a
semantic tag added to the data subsequently.
Assume we capture a photograph of our pet using our phone. It routinely accounts for the
date and time the photograph was shot. The GPS coordinates now of capture and our device's
ID. Ouraccount information is associated with the file using a b-basedstorage service, such
as iCloud.
When we send an email, the time it was sent, the email addressesto and from, the internet
protocol address of the device from which the email was sent, and other bits of
information are associated with the email's content.
In both cases, the actual content (i.e., the pixels that comprise the picture and the text that
comprise the email) is unstructured. However, some components enable the data to be
categorised according to qualities.
Big Data
What is Big Data?
Big Data refers to large amounts of massive data yet increasesexponentially in size over
time. Data is so extensive and complicated that no usual data management methods can
effectively store or handle it. Big data is likewise data, but it isenormous.
Examples of Big Data.
Discovering consumer shopping habits

Finding new customer leads
Fuel optimisation tools for the transportation industry
Live road mapping for autonomous vehicles
Monitoring health conditions through data from wearables Personalised health plans
for cancer patients
Personalised marketing
Predictive inventory ordering
Real-time data monitoring and cybersecurity protocols
Streamlined media streaming
User demand prediction for ridesharing companies
Big Data gives us unprecedented insights and opportunities, but it also raises concerns and
questions that must be addressed:
Data privacy: The Big Data we now generate contains a lot of information about our
personal lives, much of which we have aright to keep private
Data security: Even if we decide we are happy for someone to have our data for a purpose,
can we trust them to keep it safe?
Data discrimination: When everything is known, will it become acceptable to discriminate

against people based on data we haveon their lives? We already use credit scoring to decide
who can borrow money, and insurance is heavily data-driven.
Data quality: Not enough emphasis on quality and contextualrelevance. The trend with
technology is collecting more raw datacloser to the end user. The danger is data in raw format has
qualityissues. Reducing the gap between the end user and raw dataincreases issues in data
quality.
Facing up to these challenges is an important part of Big Data, andthey must be addressed by
organisations who want to take advantage of data. Failure to do so can leave businesses
vulnerable, not just in terms of their reputation, but also legally and financially.
Big Data Characteristics
Three attributes stand out as defining Big Data characteristics:

1. Volume: Huge volume of data: Rather than thousands or millions of rows, Big Data
can be billions of rows and millions ofcolumns.
2. Variety: Complexity of data types and structures: Big Datareflects the variety of new
data sources, formats, and structures,including digital traces being left on the web and other
digitalrepositories for subsequent analysis.
3. Velocity: Speed of new data creation and growth: Big Data candescribe high velocity
data, with rapid data ingestion and nearreal time analysis.
Fig. 1: Big data V's
This can be extended to add a fourth and fifth V’s

4. Veracity: It is equivalent to quality. We have all the data, but could we be missing
something? Are the data “clean” and accurate? Do they really have something to offer?
5. Value: There is another V to take into account when looking at big data: Value having
access to big data is no good unless we canturn it into value. Companies are starting to generate
amazing value from their big data.
Big Data Analytics
Data Analytics is the process of examining raw data (data sets) with the
purpose of drawingconclusions about that information, increasingly with the aid of
specialized systems and software.
Data Analytics involves applying an algorithmic or mechanical process to derive
insights. For example, running through a number of data sets to look for meaningful
correlations between each other.
It is used in a number of industries to allow the organizations and companies to
make betterdecisions as well as verify and disprove existing theories or models.
The focus of Data Analytics lies in inference, which is the process of deriving
conclusions that are solely based on what the researcher already knows.
Data analytics initiatives can help businesses increase revenues, improve
operational efficiency, optimize marketing campaigns and customer service efforts,
respond more quickly to emergingmarket trends and gain a competitive edge -- all
with the ultimate goal of boosting business performance.
Types of Analytics:
There are 4 types of analytics. Here, we start with the simplest one and go down to
more sophisticated. As it happens, the more complex an analysis is, the more value it
brings.
1. Descriptive Analytics
2. Diagnostic Analysis
3. Perspective Analytics
4. Predictive Analytics
1. Descriptive analytics:
• The simplest way to define descriptive analytics is that, it answers the question “What has
happened?”
• This type of analytics, analyses the data coming in real-time and historical data for insights on
how toapproach the future.
• The main objective of descriptive analytics is to find out the reasons behind precious success or
failure in the past.
• The ‘Past’ here, refers to any particular time in which an event had occurred and this could be a
month
ago or even just a minute ago.
• The vast majority of big data analytics used by organizations falls into the category of
descriptive analytics. 90% of organizations today use descriptive analytics which is the most
basic form of analytics.
• Descriptive analytics juggles raw data from multiple data sources to give valuable insights into
the past. However, these findings simply signal that something is wrong or right, without
explaining why. For this reason, highly data-driven companies do not content themselves with
descriptive analytics only, and prefer combining it with other types of data analytics.
• Eg:A manufacturer was able to decide on focus product categories based on the analysis of
revenue, monthly revenue per product group, income by product group, total quality of metal
parts produced per month.
2. Diagnostic analytics:
• At this stage, historical data can be measured against other data to answer the question of why
something happened.
• Companies go for diagnostic analytics, as it gives a deep insight into a particular problem. At
the same time, a company should have detailed information at their disposal; otherwise data
collection may turn out to be individual for every issue and time-consuming.
• Eg: Let’s take another look at the examples from different industries: a healthcare provider
compares patients’ response to a promotional campaign in different regions; a retailer drills the
sales down to subcategories.
3. Predictive analytics:
• Predictive analytics tells what is likely to be happen. It uses the findings of descriptive and
diagnostic analytics to detect tendencies, clusters and exceptions, and to predict future trends,
which makes it avaluable tool for forecasting.
• Despite numerous advantages that predictive analytics brings, it is essential to understand that
forecasting is just an estimate, the accuracy of which highly depends on data quality and
stability of the situation, so it requires a careful treatment and continuous optimization.
• Eg: A management team can weigh the risks of investing in their company’s expansion based
on cash flow analysis and forecasting. Organizations like Walmart, Amazon and other retailers
leverage predictive analytics to identify trends in sales based on purchase patterns of
customers, forecasting customer behavior, forecasting inventory levels, predicting what
products customers are likely to purchase together so that they can offer personalized
recommendations, predicting the amount of sales at the end of the quarter or year.
4. Prescriptive analytics
• The purpose of prescriptive analytics is to literally prescribe what action to take to eliminate
a future problem or take full advantage of a promising trend.
• Prescriptive analytics is a combination of data, mathematical models and various business rules.
• The data for prescriptive analytics can be both internal (within the organization) and external
(like social media data).
• Besides, prescriptive analytics uses sophisticated tools and technologies, like machine
learning, business rules and algorithms, which make it sophisticated to implement and manage.
That is why, before deciding to adopt prescriptive analytics, a company should compare
required efforts vs. an expected added value.
• Prescriptive analytics are comparatively complex in nature and many companies are not yet
using them in day-to-day business activities, as it becomes difficult to manage. Large scale
organizations use prescriptive analytics for scheduling the inventory in the supply chain,
optimizing production, etc. to optimize customer experience.
• An example of prescriptive analytics: a multinational company was able to identify

opportunities for repeat purchases based on customer analytics and sales history.
*Descriptive and diagnostic analytics help you construct a narrative of the past while
predictive and prescriptive analytics help you envision a possible future.
Need for Big Data Analytics:
The new benefits that big data analytics brings to the table, however, are speed and efficiency.
Whereas a few years ago a business would have gathered information, run analytics and
unearthed information that could be used for future decisions, today that business can identify
insights for immediate decisions. The ability to work faster – and stay agile – gives organizations
a competitive edge they didn’t have before.
Big data analytics helps organizations harness their data and use it to identify new
opportunities. That, in turn, leads to smarter business moves, more efficient operations, higher
profits and happier customers in the following ways:
1. Cost reduction: Big data technologies such as Hadoop and cloud-based analytics bring
significant cost advantages when it comes to storing large amounts of data – plus they can
identify more efficient ways of doing business.
2. Faster, better decision making: With the speed of Hadoop and in-memory analytics,
combined with the ability to analyze new sources of data, businesses are able to analyze
information immediately – and makedecisions based on what they’ve learned.
3. New products and services: With the ability to gauge customer needs and satisfaction through
analytics comes the power to give customers what they want. Davenport points out that with
big data analytics, more companies are creating new products to meet customers’ needs.
4. End Users Can Visualize Data: While the business intelligence software market is relatively
mature, a bigdata initiative is going to require next-level data visualization tools, which present
BI data in easy-to-read charts, graphs and slideshows. Due to the vast quantities of data being
examined, these applications must be able to offer processing engines that let end users query
and manipulate information quickly—even inreal time in some cases.
History of Hadoop
Hadoop is an Apache Software Foundation-managed open- source framework developed in
Java for storing and analysingmassive information on commodity hardware clusters. There are
primarily two issues with big data. The first is to store such amassive quantity of data, and the
second is to process it. Thus,Hadoop serves as a solution to the issue of big data, namely the
storage and processing of large amounts of data with specificadditional capabilities. Hadoop
is composed chiefly of HadoopDistributed File System (HDFS) and Yet Another Resource
Negotiator (YARN).
Hadoop's Historical Background
Hadoop was originated in 2002 and founded by Doug Cutting andMike Cafarella as part of their
work on the Apache Nutch project. The Apache Nutch project was tasked with developing a
searchengine system capable of indexing one billion documents. Afterdoing extensive study on
Nutch, they determined that such asystem would cost roughly half a million dollars in hardware
anda monthly operating cost of approximately $30 000, which is rather costly. As a result, they
discovered that their project designwould not cope with the billions of online pages. As a result, they
sought a practical solution to minimise the implementation cost and store and process massive
datasets.
In 2003, they discovered a document describing the design ofGoogle's distributed file system,
GFS (Google File System), whichGoogle released to store massive data collections. They now see
that this research can resolve their storage of huge files created by web crawling and indexing
operations. However, this researchprovided just a partial answer to their difficulty. In 2004, Google
produced another article on MapReduce technology used to handle such massive datasets. For
Doug Cutting and Mike Cafarella, this report was another half-solution to their Nutch project.
Both approaches (GFS and MapReduce) were previouslyonly available as white papers at Google.
Google did not use anyof these approaches. Doug Cutting recognised through his work on Apache
Lucene (a free and open-source information retrieval software library that Doug Cutting first
wrote in Java in 1999)that open-source is an excellent approach to sharing technology with a
broader audience. As a result, he began working with Mike Cafarella on open-source
implementations of Google's algorithms(GFS & MapReduce) in the Apache Nutch project.
Cutting discovered in 2005 that Nutch is confined to clusters of between 20 and 40 nodes. He
quickly saw two issues:
Nutch would not reach its full potential until it could run stably onmore prominent clusters (b), which
seemed unachievable with just two workers (Doug Cutting & Mike Cafarella). The engineering
work in the Nutch project was far more than he anticipated. As aresult, he began looking for work
with a firm willing to invest in their efforts. Moreover, he discovered that Yahoo! has a sizable
engineering staff ready to work on this project.
Thus, Doug Cutting joined Yahoo in 2006 and the Nutch project.With the assistance of Yahoo,
he wanted to present the world with an open-source, dependable, and scalable computing
architecture. Thus, the first separate the distributed computing components of Nutch and
establishes a new project at Yahoo.Hadoop (He provided the name Hadoop since it was the name
of ayellow toy elephant that Doug Cutting's kid had. because it was simple to say and was a one-
of-a-kind term.) Now he desired to optimise Hadoop's performance on hundreds of nodes. As a
result,he began working on Hadoop using GFS and MapReduce.
Yahoo began utilising Hadoop in 2007 after successfully testing iton a 1000-node cluster. In January
2008, Yahoo donated Hadoopto the Apache Software Foundation as an open-source project
(Apache Software Foundation). Additionally, in July 2008, the Apache Software Foundation
successfully tested Hadoop on a4000-node cluster. Hadoop was successfully tested in 2009 for
sorting a PB (PetaByte) of data in less than 17 hours for processingbillions of queries and indexing
millions of web pages. Moreover, Doug Cutting Departed Yahoo to join Cloudera to take on the task
of bringing Hadoop to new sectors.
Fig. 2: Hadoop History
▪ Apache Hadoop version 1.0 was published by the ApacheSoftware Foundation in December
2011.
▪ Additionally, Version 2.0.6 was released in August 2013.
▪ Furthermore, as of December 2017, we have ApacheHadoop version 3.0.
Apache Hadoop
Apache Hadoop is a free and open-source platform for storing and

processing massive information ranging in size from gigabytes topetabytes. Rather than storing
and processing data on a singlecolossal computer, Hadoop enables clustering many computers to
analyse enormous datasets in parallel.
Four significant modules of Hadoop:
HDFS — A distributed file system that works on commodity or low- end hardware. HDFS
outperforms conventional file systems indata performance, fault tolerance, and native support for
massivedatasets.
YARN — manages and monitors cluster nodes and resource utilisation. It automates the
scheduling of jobs and tasks.
MapReduce — A framework that enables parallel computing on data by programmes. The map
job turns the input data into a dataset calculated in key-value pairs. It is reducing tasks that consume
the output of the map task in order to aggregate it andproduce the desired result.
Hadoop Common — Provides a set of shared Java libraries utilisedby all modules.
How Hadoop Operates
Hadoop simplifies the process of using all the data storage capacity available in cluster
computers and executing distributedalgorithms against massive volumes of data. Hadoop offers
the
foundation for the development of additional services and applications.
Applications that gather data in various forms may upload data tothe Hadoop cluster by connecting
to the NameNode through anAPI function. The NameNode maintains the directory structure of
each file and the location of "chunks" for each file, which is duplicated among DataNodes. To
launch a job that queries thedata, supply a MapReduce job consisting of several maps and
reduce jobs executing on the data stored in HDFS across the DataNodes. Each node executes
map tasks against the specified input files, while reducers execute to aggregate and arrange the
final output.
Due to Hadoop's flexibility, the ecosystem has evolved tremendously over the years. Today,
the Hadoop ecosystem comprises a variety of tools and applications that aid in the collection,
storage, processing, analysis, and management of large amounts of data. Several of the most
prominent uses include the following:
Spark — An open-source distributed processing technology oftenused to handle large amounts

of data. Apache Spark providesgeneral batch processing, streaming analytics, machine learning,
graph databases, and ad hoc queries through in-memory cachingand efficient execution.
Presto — A distributed SQL query engine geared for low-latency, ad-hoc data processing. It
adheres to the ANSI SQL standard, which includes sophisticated searches, aggregations, joins, and
window functions. Presto can handle data from various sources,
Analyzing data with UNIX

Through the use of Unix tools: Software developers can quickly explore and modify code, data,
and tests. IT professionals can scrutinize log files, network traces, performance figures,
filesystems and the behavior of processes. Data analysts can extract, transform, filter, process,
load, and summarize huge data sets.
Data Transformation. These programs are useful for changing the format of data files, for
transforming data, and for filtering unwanted data. In addition, one program i9 useful for
monitoring the progress of the data transformations.
Data Validation. These include programs for checking the number of lines and columns in data
files, their types (e.g., alphanumeric, integer), and their ranges.
Descriptive Statistics. These procedures include both numerical statistics, and simple graphical
displays. There are procedures for single distributions, paired data, and multivariate cases.
Inferential Statistics. These include multivariate linear regression and analysis of variance. Some
simple inferential statistics are also incorporated into the descriptive statistics programs, but are
used less often.
Table of Programs Described

Pre-Processing Programs
ABUT abut files
DM conditional transformations of data
IO control/monitor file input/output
TRANSPOSE transpose matrix type file
VALIDATA verify data file consistency
Analysis Programs
ANOVA anova with repeated measures
BIPLOT bivariate plotting + summary statistics
CORR linear correlation + summary statistics
DESC statistics, frequency tables, histograms
PAIR bivariate statistics + scatterplots
REGRESS multivariate linear regression
To understand how to work with Unix, data — Weather Dataset is used.

Weather sensors gather information consistently in numerous areas over the globe and assemble an
enormous volume of log information, which is a decent possibility for investigation with
MapReduce in light of the fact that is required to process every one of the information, and the
information is record-oriented and semi-organized.
The information utilized is from the National Climatic Data Center or NCDC. The data is in line-
oriented ASCII format & and the line is a record. Data files are organized by date and weather
station.
Structure of NCDC record
Structure Of Weather data set. (Credits: Hadoop The Definitive Guide, Third Edition by Tom
White)So now we’ll find out the highest recorded global temperature in the dataset (for each year)
using Unix. The classic tool for processing line-oriented data is awk.
Small script to find the maximum temperature for each year in NCDC data
To understand
what’s actually happening in the above script let’s break down the script and its functionality:
1. The script begins with the shebang #!/usr/bin/env bash, which specifies the interpreter to be
used for executing the script (in this case, Bash).
2. The script uses a for loop to iterate over the files in the all/ directory. Each file corresponds
to a year's weather records.
3. Within the loop, echo is used to print the year by extracting it from the filename using
the basename command.
4. The gunzip -c $year command decompresses the file and outputs its contents to the standard
output.
5. The output of gunzip is then piped (|) to the awk command for further processing.
6. The awk script inside the curly braces {} performs the data extraction and analysis. It extracts
two fields from each line of the data: the air temperature and the quality code.
7. The extracted air temperature is converted into an integer by adding 0 (temp = substr($0, 88,
5) + 0).
8. Conditions are checked to determine if the temperature is valid and the quality code indicates
a reliable reading. Specifically, it checks if the temperature is not equal to 9999 (which
represents a missing value) and if the quality code matches the pattern [01459].
9. If the temperature passes the validity check and is greater than the current maximum
temperature (temp > max), the max variable is updated with the new maximum value.
10. After processing all the lines in the file, the END block is executed, and it prints the
maximum temperature found (print max).
11. The loop continues for each file in the all/ directory, resulting in the maximum temperature
for each year being printed.
The output of running the script will display the year and its corresponding maximum
temperature, such as below:
→ The temperature values in the source file are scaled by a factor of 10, so a value of 317
corresponds to a maximum temperature of 31.7°C for the year 1901.
→ This script serves as a baseline for performance comparison and demonstrates how Unix tools
like awk can be utilized for data analysis tasks without relying on Hadoop or other distributed
computing frameworks.
→ The script’s author mentions that the complete run for the entire century took 42 minutes on a
single EC2 High-CPU Extra Large Instance.
Why we should parallelize the above work?

· To speed up processing we need to run parts of the program in parallel.
· The challenges and considerations when attempting to speed up the processing of data by running
parts of the program in parallel are:
1) Dividing the work into equal-size pieces: When processing data in parallel, dividing the work
into equal-size pieces is not always straightforward. File size for different years varies.
2) Combining the results from independent processes: combining the results from independent
processes may need further processing. In this case, the result for each year is independent of other
years and may be combined by concatenating all the results and sorting by year. This gets very
complicated.
3) Processing capacity of a single machine: Even when parallelizing the processing, the overall
performance is still limited by the processing capacity of a single machine. If the best time
achieved is 20 minutes using the available processors, further improvement may not be possible on
that machine alone. Additionally, some datasets may grow beyond the capacity of a single machine,
requiring the use of multiple machines for processing.
So, though it’s feasible to parallelize the processing, in practice it’s messy. Using a framework
like Hadoop to take care of these issues is a great help.
Hadoop is an open-source software framework that facilitates distributed storage and processing of
large data sets across clusters of computers using simple programming models. In this article, we
will discuss how Hadoop can be used to analyze data.
Data Analysis with Hadoop

Hadoop provides several tools that can be used for data analysis, including Hadoop Distributed File
System (HDFS), MapReduce, Pig, and Hive.
1.Hadoop Distributed File System (HDFS):

HDFS is a distributed file system that is designed to store and managelarge data sets reliably and
efficiently. It is built to run on commodity hardware, making it cost-effective for data storage.
2.MapReduce:
MapReduce is a programming model for processing large data sets with a parallel, distributed
algorithm on a cluster. It consists of two phases:the map phase and the reduce phase. In the map
phase, the input data is split into smaller chunks and processed by different nodes in the
cluster. In the reduce phase, the results of the map phase are combined
to produce the final output.
3.Pig:
Pig is a high-level platform for creating MapReduce programs used for analyzing large data sets. It
provides a high-level language called Pig Latin for expressing data analysis programs.
4.Hive:
Hive is a data warehousing framework built on top of Hadoop. It provides a SQL-like language
called HiveQL for querying and analyzing data stored in Hadoop.
Steps for Data Analysis with Hadoop:
1.Data Ingestion:
The first step in data analysis with Hadoop is to ingest data into HDFS. The data can be ingested
using various tools, including Sqoop, Flume, and Kafka.
2.Data Processing:
Once the data is ingested into HDFS, it can be processed using various tools, including MapReduce,
Pig, and Hive.
3.Data Analysis:
After the data is processed, it can be analyzed using various tools, including HiveQL and Pig Latin.
4.Data Visualization:
The final step in data analysis with Hadoop is data visualization. Various tools, including Tableau
and Excel, can be used to visualize the results of data analysis.
Conclusion:
Hadoop provides a powerful framework for analyzing large data sets. With tools like HDFS,
MapReduce, Pig, and Hive, users can ingest, process, and analyze large data sets in a distributed
environment. The framework is cost-effective and scalable, making it a popular choice for
organizations dealing with large volumes of data.
Analysing Data with Hadoop
To take advantage of the parallel processing that Hadoop provides, we need to express our query as
a MapReduce job. After some local, small-scale testing, we will be able to run it on a cluster of
machines.
MapReduce works by breaking the processing into two phases: the map phase and the reduce
phase.
Each phase has key-value pairs as input and output, the types of which may be chosen by the
programmer. The programmer also specifies two functions: the map function and the reduce
function. The input to our map phase is the raw NCDC data. We choose a text input format that
gives us each line in the dataset as a text value. The key is the offset of the beginning of the line
from the beginning of the file, but as we have no need for this, we ignore it.
To visualize the way the map works, consider the following sample lines of input data (some
unused columns have been dropped to fit the page, indicated by ellipses):
0067011990999991950051507004…9999999N9+00001+99999999999…
0043011990999991950051512004…9999999N9+00221+99999999999…
0043011990999991950051518004…9999999N9–00111+99999999999…
0043012650999991949032412004…0500001N9+01111+99999999999…
0043012650999991949032418004…0500001N9+00781+99999999999…
These lines are presented to the map function as the key-value pairs:
(0, 0067011990999991950051507004…9999999N9+00001+99999999999…) (106,
0043011990999991950051512004…9999999N9+00221+99999999999…) (212,
0043011990999991950051518004…9999999N9–00111+99999999999…) (318,
0043012650999991949032412004…0500001N9+01111+99999999999…) (424,
0043012650999991949032418004…0500001N9+00781+99999999999…)
The keys are the line offsets within the file, which we ignore in our map function. The map
function merely extracts the year and the air temperature (indicated in bold text), and emits them as
its output (the temperature values have been interpreted as integers):
(1950, 0)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)
The output from the map function is processed by the MapReduce framework before being sent to
the reduce function. This processing sorts and groups the key-value pairs by key. So, continuing the
example, our reduce function sees the following input:
(1949, [111, 78])
(1950, [0, 22, −11])
Each year appears with a list of all its air temperature readings. All the reduce function has to do
now is iterate through the list and pick up the maximum reading:
(1949, 111)(1950, 22)
This is the final output: the maximum global temperature recorded in each year.
HADOOP ECOSYSTEM
➢ Hadoop Ecosystem is neither a programming language nor a service; it is a platform or

framework which solves big data problems. You can consider it as a suite which encompasses a
number of services (ingesting, storing, analyzing and maintaining) inside it. Let us discuss and
get a brief idea about how the services work individually and in collaboration.
➢ The Hadoop ecosystem provides the furnishings that turn the framework into a comfortable
home forbig data activity that reflects your specific needs and tastes.
➢ The Hadoop ecosystem includes both official Apache open source projects and a wide range
of commercial tools and solutions.
➢ Below are the Hadoop components, that together form a Hadoop ecosystem,
✓ HDFS -> Hadoop Distributed File System
✓ YARN -> Yet Another Resource Negotiator
✓ MapReduce -> Data processing using programming
✓ Spark -> In-memory Data Processing
✓ PIG, HIVE-> Data Processing Services using Query (SQL-like)
✓ HBase -> NoSQL Database
✓ Mahout, Spark MLlib -> Machine Learning
✓ Apache Drill -> SQL on Hadoop
✓ Zookeeper -> Managing Cluster
✓ Oozie -> Job Scheduling
✓ Flume, Sqoop -> Data Ingesting Services
✓ Solr&Lucene -> Searching & Indexing
Apache open
source Hadoop ecosystem elements:
Spark, Pig, and Hive are three of the best-known Apache Hadoop projects. Each is used to
create applications to process Hadoop data.
• Spark: Apache Spark is a framework for real time data analytics in a distributed computing
environment. It executes in-memory computations to increase speed of data processing over
Map- Reduce.
• Hive: Facebook created HIVE for people who are fluent with SQL. Basically, HIVE is a data
warehousing component which performs reading, writing and managing large data sets in a
distributed environment using SQL-like interface. The query language of Hive is called Hive
Query Language (HQL), which is very similar like SQL. HIVE + SQL = HQL. It provides
tools for ETL operations and brings some SQL-like capabilities to the environment.
• Pig: Pig is a procedural language for developing parallel processing applications for large data
sets in the Hadoop environment. Pig is an alternative to Java programming for MapReduce, and
automatically generates MapReduce functions. Pig includes Pig Latin, which is a scripting
language. Pig translates Pig Latin scripts into MapReduce, which can then run on YARN and
process data in the HDFS cluster.
• HBase: HBase is a scalable, distributed, NoSQL database that sits atop the HFDS. It was
designed to store structured data in tables that could have billions of rows and millions of
columns. It has been deployed to power historical searches through large data sets, especially
when the desired data is contained within a large amount of unimportant or irrelevant data (also
known as sparse data sets).
• Oozie: Oozie is the workflow scheduler that was developed as part of the Apache Hadoop
project. It manages how workflows start and execute, and also controls the execution path.
Oozie is a server- based Java web application that uses workflow definitions written in hPDL,
which is an XML Process Definition Language similar to JBOSS JBPM jPDL.
• Sqoop: Sqoop is bi-directional data injection tool. Think of Sqoop as a front-end loader for
big data. Sqoop is a command-line interface that facilitates moving bulk data from Hadoop
into relational databases and other structured data stores. Using Sqoop replaces the need to
develop scripts to export and import data. One common use case is to move data from an
enterprise data warehouse to a Hadoop cluster for ETL processing. Performing ETL on the
commodity Hadoop cluster is resource efficient, while Sqoop provides a practical transfer
method.
• Ambari – A web-based tool for provisioning, managing, and monitoring Apache Hadoop
clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog,
HBase, ZooKeeper, Oozie, Pig, and Sqoop.
• Flume – A distributed, reliable, and available service for efficiently collecting, aggregating,
and moving large amounts of streaming event data.
• Mahout – A scalable machine learning and data mining library.
• Zookeeper – A high-performance coordination service for distributed applications.
The ecosystem elements described above are all open source Apache Hadoop projects. There
arenumerous commercial solutions that use or support the open source Hadoop projects.
Hadoop Distributions
➢ Hadoop is an open-source, catch-all technology solution with incredible scalability,
low cost storage systems and fast paced big data analytics with economical server costs.
➢ Hadoop Vendor distributions overcome the drawbacks and issues with the open source
edition of Hadoop. These distributions have added functionalities that focus on:
• Support:
Most of the Hadoop vendors provide technical guidance and assistance that makes it easy for
customers to adopt Hadoop for enterprise level tasks and mission critical applications.
• Reliability:
Hadoop vendors promptly act in response whenever a bug is detected. With the intent to
make commercial solutions more stable, patches and fixes are deployed immediately.
• Completeness:
Hadoop vendors couple their distributions with various other add-on tools which help
customerscustomize the Hadoop application to address their specific tasks.
• Fault Tolerant:
Since the data has a default replication factor of three, it is highly available and fault-tolerant.
Here is a list of top Hadoop Vendors who play a key role in big data market growth
➢ Amazon Elastic MapReduce
➢ Cloudera CDH Hadoop Distribution
➢ Hortonworks Data Platform (HDP)
➢ MapR Hadoop Distribution
➢ IBM Open Platform (IBM Infosphere Big insights)
➢ Microsoft Azure's HDInsight -Cloud based Hadoop Distribution
Advantages of Hadoop:
The increase in the requirement of computing resources has made Hadoop a viable and
extensively used programming framework. Modern day organizations can learn Hadoop and
leverage their knowhow of managing processing power of their businesses.
1. Scalable: Hadoop is a highly scalable storage platform, because it can stores and
distribute very large data sets across hundreds of inexpensive servers that operate in parallel.
Unlike traditional relational database systems (RDBMS) that can’t scale to process large
amounts of data, Hadoop enables businesses to run applications on thousands of nodes
involving many thousands of terabytes of data.
2. Cost effective: Hadoop also offers a cost effective storage solution for businesses’
exploding data sets. The problem with traditional relational database management systems is
that it is extremely cost prohibitive to scale to such a degree in order to process such massive
volumes of data. In an effort to reduce costs, many companies in the past would have had to
down-sample data and classify it based on certain assumptions as to which data was the most
valuable. The raw data would be deleted, as it would be too cost-prohibitive to keep. While
this approach may have worked in the short term, this meant that when business priorities
changed, the complete raw data set was not available, as it was too expensive tostore.
3. Flexible: Hadoop enables businesses to easily access new data sources and tap into
different types of data (both structured and unstructured) to generate value from that data. This
means businesses can use Hadoop to derive valuable business insights from data sources such
as social media, email conversations. Hadoop can be used for a wide variety of purposes, such
as log processing, recommendation systems, data warehousing, market campaign analysis and
fraud detection.
4. Speed of Processing: Hadoop’s unique storage method is based on a distributed

file system that basically ‘maps’ data wherever it is located on a cluster. The tools for data
processing are often on the same servers where the data is located, resulting in much faster
data processing. If you’re dealing with large volumes of unstructured data, Hadoop is able to
efficiently process terabytes of data in just minutes, and petabytes in hours.
Resilient to failure: A key advantage of using Hadoop is its fault tolerance. When data is sent to
an individual node, that data is also replicated to other nodes in the cluster, which means that in
the event of failure, there is another copy available for use
IBM Big Data Strategy

IBM, a US-based computer hardware and software manufacturer, had implemented a Big Data
strategy.
Where the company offered solutions to store, manage, and analyze the huge amounts of data
generated daily and equipped large and small companies to make informed business decisions.
The company believed that its Big Data and analytics products and services would help its clients
become more competitive and drive growth.
Issues :
· Understand the concept of Big Data and its importance to large, medium, and small
companies in the current industry scenario.
· Understand the need for implementing a Big Data strategy and the various issues and
challenges associated with this.
· Analyze the Big Data strategy of IBM.
· Explore ways in which IBM’s Big Data strategy could be improved further.
Introduction to InfoSphere
InfoSphere Information Server provides a single platform for data integration and governance.
The components in the suite combine to create a unified foundation for enterprise information
architectures, capable of scaling to meet any information volume requirements.
You can use the suite to deliver business results faster while maintaining data quality and integrity
throughout your information landscape.
InfoSphere Information Server helps your business and IT personnel collaborate to understand the
meaning, structure, and content of information across a wide variety of sources.
By using InfoSphere Information Server, your business can access and use information in new
ways to drive innovation, increase operational efficiency, and lower risk.
BigInsights :
BigInsights is a software platform for discovering, analyzing, and visualizing data from disparate
sources.The flexible platform is built on an Apache Hadoop open-source framework that runs in
parallel on commonly available, low-cost hardware.
Big Sheets :
BigSheets is a browser-based analytic tool included in the InfoSphere BigInsights Console that
you use to break large amounts of unstructured data into consumable, situation-specific business
contexts.
These deep insights help you to filter and manipulate data from sheets even further.

Unit I - Big Data Programming

Uploaded by

Copyright:

Available Formats

Unit I - Big Data Programming

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit I - Big Data Programming

Uploaded by

Copyright:

Available Formats

INTRODUCTION TO BIG DATA AND HADOOP

Types of Digital Data

It is comprised of all our quantitative data:

What is Big Data?

Examples of Big Data.

Discovering consumer shopping habits

Data discrimination: When everything is known, will it become acceptable to discriminate

Three attributes stand out as defining Big Data characteristics:

Fig. 1: Big data V's

This can be extended to add a fourth and fifth V’s

Big Data Analytics

• An example of prescriptive analytics: a multinational company was able to identify

Need for Big Data Analytics:

Fig. 2: Hadoop History

Apache Hadoop is a free and open-source platform for storing and

Four significant modules of Hadoop:

How Hadoop Operates

Spark — An open-source distributed processing technology oftenused to handle large amounts

Analyzing data with UNIX

Table of Programs Described

To understand how to work with Unix, data — Weather Dataset is used.

Why we should parallelize the above work?

Data Analysis with Hadoop

1.Hadoop Distributed File System (HDFS):

Steps for Data Analysis with Hadoop:

Analysing Data with Hadoop

➢ Hadoop Ecosystem is neither a programming language nor a service; it is a platform or

4. Speed of Processing: Hadoop’s unique storage method is based on a distributed

IBM Big Data Strategy

You might also like