Unit 1

UNIT-1
Introduction to Big Data & Hadoop
2
Outline
• Introduction To Big Data
• Need Of Big Data
• Big Data features and Challenges
• Problem with Traditional large-Scale System
• Source of Big Data
• 3 V’s Big Data
• Big Data Working
• Hadoop Distributed File System
• Building blocks of Hadoop
• Introducing and Configuring Hadoop cluster
• Configuring XML files
3
UNIT 1
INTRODUCTION TO BIG DATA

Big Data definition
Big Data analytics is the process of collecting, organizing and analyzing large sets
of data (called Big Data) to discover patterns and other useful information. Analysts working
with Big Data typically want the knowledge that comes from analyzing the data. This adds up to
the complexity as we have to deal with structured, semi-structured, or unstructured data.
 We can say that the term Big Data actually applies to the data that cannot be processed or
analyzed via traditional tools and techniques that are normally used to process structured or
semi-structured data such as using relational databases, XML, and so forth.
Need of Big data

Big Data helps the organizations to create new growth opportunities and entirely new categories
of companies that can combine and analyze industry data. These companies have ample
information about the products and services, buyers and suppliers, consumer preferences that can
be captured and analyzed.
Types of Big Data

Big data falls under three categories of data sets – structured, unstructured and semi-structured.
Structured
 Structured is one of the types of big data and By structured data, we mean data that can be
processed, stored, and retrieved in a fixed format. It refers to highly organized information
that can be readily and seamlessly stored and accessed from a database by simple search
engine algorithms. For instance, the employee table in a company database will be structured
as the employee details, their job positions, their salaries, etc., will be present in an organized
manner. Structured data sets comprise of data which can be used in its original form to derive
results. Examples include relational data such as employee salary records. Most modern
computers and applications are programmed to generate structured data in preset formats to
make it easier to process.
Unstructured
Unstructured data refers to the data that lacks any specific form or structure whatsoever. This
makes it very difficult and time-consuming to process and analyze unstructured data. Email is an
example of unstructured data. Structured and unstructured are two important types of big data.
Semi-structured
 Semi structured is the third type of big data. Semi-structured data pertains to the data
containing both the formats mentioned above, that is, structured and unstructured data. To be
precise, it refers to the data that although has not been classified under a particular repository
(database), yet contains vital information or tags that segregate individual elements within
the data. Thus we come to the end of types of data. Lets discuss the characteristics of data.
4
Characteristics of Big Data
 Big Data can be defined by one or more of three characteristics,
 1 volume
 2 variety
 3 velocity.
“Big data” is high-volume, velocity, and variety information assets that demand cost-effective,
innovative forms of information processing for enhanced insight and decision making.”
1) Variety
Variety of Big Data refers to structured, unstructured, and semi structured data that is gathered
from multiple sources. data could only be collected from spreadsheets and databases, today data
comes in an array of forms such as emails, PDFs, photos, videos, audios, SM posts, and so much
more. Variety is one of the important characteristics of big data.
2) Velocity
Velocity essentially refers to the speed at which data is being created in real-time. In a broader
prospect, it comprises the rate of change, linking of incoming data sets at varying speeds, and
activity bursts.
3) Volume
Volume is one of the characteristics of big data. We already know that Big Data indicates huge
‘volumes’ of data that is being generated on a daily basis from various sources like social media
platforms, business processes, machines, networks, human interactions, etc. Such a large amount
of data are stored in data warehouses. Thus comes to the end of characteristics of big data.
5
Advantages of Big Data:
➨Big data analysis derives innovative solutions. Big data analysis helps in understanding and
targeting customers. It helps in optimizing business processes.
➨It helps in improving science and research.
➨It improves healthcare and public health with availability of record of patients.
➨It helps in financial trading, sports, polling, security/law enforcement etc.
➨Any one can access vast information via surveys and deliver answer of any query.
➨Every second additions are made.
➨One platform carry unlimited information.
Disadvantages of Big Data:

➨Traditional storage can cost lot of money to store big data.
➨Lots of big data is unstructured.
➨Big data analysis violates principles of privacy.
➨It can be used for manipulation of customer records.
➨It may increase social stratification.
➨Big data analysis is not useful in short run. It needs to be analyzed for longer duration to
leverage its benefits.
➨Big data analysis results are misleading sometimes.
➨Speedy updates in big data can mismatch real figures.
Big Data Tools
Data Storage and processing:- Data processing is the process of data management , which
enables creation of valid, useful information from the collected data. Data processing includes
classification, computation, coding and updating. Data storage refers to keeping data in the best
suitable format and in the best available medium.
Data analysis:- Data analysis is defined as a process of cleaning, transforming, and modeling
data to discover useful information for business decision-making. The purpose of Data Analysis
is to extract useful information from data and taking the decision based upon the data analysis.
6
Data Analysis Tools:-
Types of Data Analysis: Techniques and Methods

There are several types of data analysis techniques that exist based on business and technology.
The major types of data analysis are:
 Text Analysis
 Statistical Analysis
 Diagnostic Analysis
 Predictive Analysis
 Prescriptive Analysis
Text Analysis
Text Analysis is also referred to as Data Mining. It is a method to discover a pattern in large data
sets using databases or data mining tools. It used to transform raw data into business information.
Business Intelligence tools are present in the market which is used to take strategic business
decisions. Overall it offers a way to extract and examine data and deriving patterns and finally
interpretation of the data.
Statistical Analysis
Statistical Analysis shows "What happen?" by using past data in the form of dashboards.
Statistical Analysis includes collection, Analysis, interpretation, presentation, and modeling of
data. It analyses a set of data or a sample of data. There are two categories of this type of
Analysis - Descriptive Analysis and Inferential Analysis.
Descriptive Analysis
Analyses complete data or a sample of summarized numerical data. It shows mean and deviation
for continuous data whereas percentage and frequency for categorical data.
Inferential Analysis
Analyses sample from complete data. In this type of Analysis, you can find different conclusions
from the same data by selecting different samples.
Diagnostic Analysis
Diagnostic Analysis shows "Why did it happen?" by finding the cause from the insight found in
Statistical Analysis. This Analysis is useful to identify behavior patterns of data. If a new
problem arrives in your business process, then you can look into this Analysis to find similar
patterns of that problem. And it may have chances to use similar prescriptions for the new
problems.
7
Predictive Analysis
Predictive Analysis shows "what is likely to happen" by using previous data. The simplest
example is like if last year I bought two dresses based on my savings and if this year my salary is
increasing double then I can buy four dresses. But of course it's not easy like this because you
have to think about other circumstances like chances of prices of clothes is increased this year or
maybe instead of dresses you want to buy a new bike, or you need to buy a house!
So here, this Analysis makes predictions about future outcomes based on current or past data.
Forecasting is just an estimate. Its accuracy is based on how much detailed information you have
and how much you dig in it.
Prescriptive Analysis
Prescriptive Analysis combines the insight from all previous Analysis to determine which action
to take in a current problem or decision. Most data-driven companies are utilizing Prescriptive
Analysis because predictive and descriptive Analysis are not enough to improve data
performance. Based on current situations and problems, they analyze the data and make
decisions.
What is data in data mining?

Data mining is defined as a process used to extract usable data from a larger set of any
raw data. It implies analyzing data patterns in large batches of data using one or more
software. Data mining has applications in multiple fields, like science and research.
No SQL Database:
No SQL databases (aka "not only SQL") are non tabular, and store data differently than
relational tables. No SQL databases come in a variety of types based on their data model. The
main types are document, key-value, wide-column, and graph.
Real-Time Data Processing

Real-time data processing is the execution of data in a short time period, providing near-
instantaneous output. The processing is done as the data is inputted, so it needs a continuous
stream of input data in order to provide a continuous output. Good examples of real-time data
processing systems are bank ATMs, traffic control systems and modern computer systems such
as the PC and mobile devices. In contrast, a batch data processing system collects data and then
processes all the data in bulk in a later time, which also means output is received at a later time.
Big Data Case studies

1). Wal-Mart
8
Wal-mart leverages Big Data and Data Mining to create personalized product recommendations
for its customers. With the help of these two emerging technologies, Wal-mart can uncover
valuable patterns showing the most frequently bought products, most popular products, and even
the most popular product bundles (products that complement each other and are usually
purchased together).Based on these insights, Wal-mart creates attractive and customized
recommendations for individual users. By effectively implementing Data Mining techniques, the
retail giant has successfully increased the conversion rates and improved its customer service
substantially. Furthermore, Walmart uses Hadoop and NoSQL technologies to allow customers
to access real-time data accumulated from disparate sources.
2). American Express
The credit card giant leverages enormous volumes of customer data to identify indicators that
could depict user loyalty. It also uses Big Data to build advanced predictive models for analyzing
historical transactions along with 115 different variables to predict potential customer churn.
Thanks to Big Data solutions and tools, American Express can identify 24% of the accounts that
are highly likely to close in the upcoming four to five months.
3). Uber
Uber is one of the major cab service providers in the world. It leverages customer data to track
and identify the most popular and most used services by the users. Once this data is collected,
Uber uses data analytics to analyze the usage patterns of customers and determine which services
should be given more emphasis and importance. Apart from this, Uber uses Big Data in another
unique way. Uber closely studies the demand and supply of its services and changes the cab fares
accordingly. It is the surge pricing mechanism that works something like this – suppose when
you are in a hurry, and you have to book a cab from a crowded location, Uber will charge you
double the normal amount.
4). Netflix
Netflix is one of the most popular on-demand online video content streaming platform used by
people around the world. Netflix is a major proponent of the recommendation engine. It collects
customer data to understand the specific needs, preferences, and taste patterns of users. Then it
uses this data to predict what individual users will like and create personalized content
recommendation lists for them. Today, Netflix has become so vast that it is even creating unique
content for users. Data is the secret ingredient that fuels both its recommendation engines and
new content decisions. The most pivotal data points used by Netflix include titles that users
watch, user ratings, genres preferred, and how often users stop the playback, to name a few.
Hadoop, Hive, and Pig are the three core c the data structure used by Netflix.
Six Challenges in Big Data Integration

The handling of big data is very complex. Some challenges faced during its integration include
uncertainty of data Management, big data talent gap, getting data into a big data structure,
syncing across data sources, getting useful information out of the big data, volume, skill
availability, solution cost etc.
1. The Uncertainty of Data Management: One disruptive facet of big data management is the
use of a wide range of innovative data management tools and frameworks whose designs are
dedicated to supporting operational and analytical processing. There are a variety of No SQL
approaches such as hierarchical object representation (such as JSON, XML and BSON) and the
concept of a key-value storage. The wide range of No SQL tools, developers and the status of the
market are creating uncertainty with the data management.
9
2. Talent Gap in Big Data: It is difficult to win the respect from media and analysts in tech
without being bombarded with content touting the value of the analysis of big data and
corresponding reliance on a wide range of disruptive technologies. No SQL data management
frameworks, in-memory analytics, and as well as the broad Hadoop ecosystem. The reality is that
there is a lack of skills available in the market for big data technologies. The typical expert has
also gained experience through tool implementation and its use as a programming model, apart
from the big data management aspects.
3. Getting Data into Big Data Structure: It might be obvious that the intent of a big data
management involves analyzing and processing a large amount of data. There are many people
who have raised expectations considering analyzing huge data sets for a big data platform. They
also may not be aware of the complexity behind the transmission, access, and delivery of data
and information from a wide range of resources and then loading these data into a big data
platform. The intricate aspects of data transmission, access and loading are only part of the
challenge. The requirement to navigate transformation and extraction is not limited to
conventional relational data sets.
4. Syncing Across Data Sources: Once you import data into big data platforms you may also
realize that data copies migrated from a wide range of sources on different rates and schedules
can rapidly get out of the synchronization with the originating system. This implies that the data
coming from one source is not out of date as compared to the data coming from another source.
It also means the commonality of data definitions, concepts, metadata and the like. The
traditional data management and data warehouses, the sequence of data transformation,
extraction and migrations all arise the situation in which there are risks for data to become
unsynchronized.
5. Extracting Information from the Data in Big Data Integration: The most practical use
cases for big data involve the availability of data, augmenting existing storage of data as well as
allowing access to end-user employing business intelligence tools for the purpose of the
discovery of data. This business intelligence must be able to connect different big data platforms
and also provide transparency of the data consumers to eliminate the requirement of custom
coding. At the same time, if the number of data consumers grow, then one can provide a need to
support an increasing collection of many simultaneous user accesses. This increment of demand
may also spike at any time in reaction to different aspects of business process cycles. It also
becomes a challenge in big data integration to ensure the right-time data availability to the data
consumers.
6. Miscellaneous Challenges: Other challenges may occur while integrating big data. Some of
the challenges include integration of data, skill availability, solution cost, the volume of data, the
rate of transformation of data, veracity and validity of data. The ability to merge data that is not
similar in source or structure and to do so at a reasonable cost and in time. It is also a challenge
to process a large amount of data at a reasonable speed so that information is available for data
consumers when they need it.
The Sources of Big Data

The bulk of big data generated comes from three primary sources: social data, machine data and
transactional data. Unstructured data does not have a pre-defined data model and therefore
requires more resources to make sense of it.
The three primary sources of Big Data
10
1) Social data comes from the Likes, Tweets & Rewets, Comments, Video Uploads, and general
media that are uploaded and shared via the world’s favorite social media platforms. This kind of
data provides invaluable insights into consumer behavior and sentiment and can be enormously
influential in marketing analytics. The public web is another good source of social data, and tools
like Google Trends can be used to good effect to increase the volume of big data.
2) Machine data is defined as information which is generated by industrial equipment, sensors
that are installed in machinery, and even web logs which track user behavior. This type of data is
expected to grow exponentially as the internet of things grows ever more pervasive and expands
around the world. Sensors such as medical devices, smart meters, road cameras, satellites, games
and the rapidly growing Internet Of Things will deliver high velocity, value, volume and variety
of data in the very near future.
3) Transactional data is generated from all the daily transactions that take place both online and
offline. Invoices, payment orders, storage records, delivery receipts – all are characterized as
transactional data yet data alone is almost meaningless, and most organizations struggle to make
sense of the data that they are generating and how it can be put to good use.
Difference between traditional data and big data

Parameters Traditional Data Big Data
Volume GB Constantly Updated(TB or PB currently)
Generated Rate Per hour, per day… More rapid(almost every second)
Structure Structured Semi-structured and unstructured

Data Source centralized Fully distributed
Data Integration easy Difficult

Data Store RDBMS HDFS, NoSQL
Access Interactive Batch or near real time

Update Scenarios Repeated read and Write Once, Repeated Read
write
Data Structure Static Schema Dynamic Schema
Scaling Potential Non-linear Somewhat close to Linear
Examples of Patterns Derived from Social Media

Health Money Personal Infrastructure
Detection of diseases or Questionable Is someone a safe Is a machine component

outbreaks trading practices driver? wearing out or likely to
break?
11
Activity and growth Credit card fraud Is someone having Designing roads to reflect
patterns an affair? traffic patterns and
activity in different areas
Probability of a heart Identify process Who you will vote Activity and growth
attack or stroke failures and for patterns
security
breaches
Are you an alcoholic? How money is Products in your Identify process failures
spent home and security breaches
The outbreak of a virus How much Are you likely to Driving patterns in a city
money you make commit a crime?
Purchase What you do for A good place to put a
patterns relaxation store or business
Products you are How you use a Brand loyalty and why
likely to buy website people switch brands
Probability of Products you are

loan default likely to buy
Brand loyalty Type of people you
and why people associate with
switch brands
Data Challenges
They say that necessity is the mother of all invention. That definitely holds true for data. Banks,
governments, insurance firms, manufacturing companies, health institutions, and retail
companies all realized the issues of working with these large volumes of data. Yet, it was the
Internet companies that were forced to solve it. Organizations such as Google, Yahoo!,
Facebook, and eBay were ingesting massive volumes of data that were increasing in size and
velocity every day, and to stay in business they had to solve this data problem. Google wanted to
be able to rank the Internet. It knew the data volume was large and would grow larger every day.
It went to the traditional database and storage vendors and saw that the costs of using their
software licenses and storage technology was so prohibitive they could not even be considered.
So Google realized it needed a new technology and a new way of addressing the data challenges.
Data problem
 Google realized that if it wanted to be able to rank the Internet, it had to design a new
way of solving the problem. It started with looking at what was needed:
 Inexpensive storage that could store massive amounts of data cost effectively
 To scale cost effectively as the data volume continued to increase
12
 To analyze these large data volumes very fast
 To be able to correlate semi-structured and unstructured data with existing structured data
 To work with unstructured data that had many forms that could change frequently; for
example, data structures from organizations such as Twitter can change regularly
 Google also identified the problems:
 The traditional storage vendor solutions were too expensive.
 When processing very large volumes of data at the level of hundreds of terabytes and
Peta bytes, technologies based on “shared block-level storage” were too slow and
couldn’t scale cost effectively. Relational databases and data warehouses were not
designed for the new level of scale of data ingestion, storage, and processing that was
required. Today’s data scale requires a high-performance super-computer platform that
could scale at cost.
 The processing model of relational databases that read data in 8k and 16k increments and
then loaded the data into memory to be accessed by software programs was too
inefficient for working with large volumes of data.
 The traditional relational database and data warehouse software licenses were too
expensive for the scale of data Google needed.
 The architecture and processing models of relational databases and data warehouses were
designed to handle transactions for a world that existed 30 to 40 years ago. These
architectures and processing models were not designed to process the semi-structured and
unstructured data coming from social media, machine sensors, GPS coordinates, and
RFID. Solutions to address these challenges are so expensive that organizations wanted
another choice.
 Reducing business data latency was needed. Business data latency is the differential
between the time when data is stored to the time when the data can be analyzed to solve
business problems.
 Google needed a large single data repository to store all the data. Walk into any large
organization and it typically has thousands of relational databases along with a number of
different data warehouse and business analysis solutions. All these data platforms stored
their data in their own independent silos. The data needed to be correlated and analyzed
with different datasets to maximize business value. Moving data across data silos is
expensive, requires lots of resources, and significantly slows down the time to business
insight.
Solving the Data Problem

Collecting, storing, sharing and securing data.
Creating and until is meaningful insights from their data
 Step 1: Define the problem
 Step 2: Decide on an approach
 Step 3: Collect data
 Step 4: Analyze data
 Step 5: Interpret results
The solution criteria follows:
 Inexpensive storage. The most inexpensive storage is local storage from off-the-shelf
disks.
13
 A data platform that could handle large volumes of data and be linearly scalable at cost
and performance.
 A highly parallel processing model that was highly distributed to access and compute the
data very fast.
 A data repository that could break down the silos and store structured, semi-structured,
and unstructured data to make it easy to correlate and analyze the data together.
 These are still recommended readings because they lay down the foundation for the
processing and storage of Hadoop. These articles are also insightful because they define
the business drivers and technical challenges Google wanted to solve.
The Necessity and Environment for Solving the Data Problem
 The environment that solved the problem turned out to be Silicon Valley in California,
and the culture was open source. In Silicon Valley, a number of Internet companies had
to solve the same problem to stay in business, but they needed to be able to share and
exchange ideas with other smart people who could add the additional components.
Silicon Valley is unique in that it has a large number of startup and Internet companies
that by their nature are innovative, believe in open source, and have a large amount of
cross-pollination in a very condensed area. Open source is a culture of exchanging ideas
and writing software from individuals and companies around the world. Larger
proprietary companies might have hundreds or thousands of engineers and customers, but
open source has tens of thousands to millions of individuals who can write software and
download and test software.
 Individuals from Google, Yahoo!, and the open source community created a solution for
the data problem called Hadoop. Hadoop was created for a very important reason—
survival. The Internet companies needed to solve this data problem to stay in business
and be able to grow.
How can I access Big Data?

Big Data is available in an endless number of places and it’s only increasing as time goes on. A
simple Google search will enable you to find a data repository for just about everything. A lot of
people aren’t aware of just how much data is already available for access and analysis.
How you can access and utilize this data can be split into six parts:
Data Extraction
 Before anything happens, some data is needed. This can be gained in a number of ways,
normally via an API call to a company’s web service.
Data Storage
 The main difficulty with Big Data is managing how it will be stored. It all depends on the
budget and expertise of the individual responsible for setting up the data storage as most
providers will require some programming knowledge to implement. A good provider
should allow you a safe, straight-forward place to store and query your data.
Data Cleaning
 Like it or not, data sets come in all shapes and sizes. Before you can even think about
how the data will be stored, you need to make sure it is in a clean and acceptable format.
Data Mining
 Data mining is the process of discovering insights within a database. The aim of this is to
provide predictions and make decisions based on the data currently held.
14
Data Analysis
 Once all the data has been collected it needs to be analyzed to look for interesting
patterns and trends. A good data analyst will spot something out of the ordinary, or
something that hasn’t been reported by anyone else.
Data Visualization
 Perhaps the most important is the visualization of the data. This is the part that takes all
the work done prior and outputs a visualization that ideally anyone can understand. This
can be done using programming languages such as Plot.ly and d3.js or software such as
Tableau.
Google File System

Google is a multi-billion dollar company. It's one of the big power players on the World Wide
Web and beyond. The company relies on a distributed computing system to provide users with
the infrastructure they need to access, create and alter data. Surely Google buys state-of-the-
art computers and servers to keep things running smoothly, right?
Wrong. The machines that power Google's operations aren't cutting-edge power computers with
lots of bells and whistles. In fact, they're relatively inexpensive machines running
on Linux operating systems. How can one of the most influential companies on the Web rely on
cheap hardware? It's due to the Google File System (GFS), which capitalizes on the strengths of
off-the-shelf servers while compensating for any hardware weaknesses
Google uses the GFS to organize and manipulate huge files and to allow application developers
the research and development resources they require. The GFS is unique to Google and isn't for
sale. But it could serve as a model for file systems for organizations with similar needs.
Some GFS details remain a mystery to anyone outside of Google. For example, Google doesn't
reveal how many computers it uses to operate the GFS. In official Google papers, the company
only says that there are "thousands" of computers in the system (source: Google). But despite
this veil of secrecy, Google has made much of the GFS's structure and operation public
knowledge.
Introduction
 From the introduction of the paper, we have:
 We have designed and implemented the Google File System (GFS) to meet the rapidly
growing demands of Google’s data processing needs. GFS shares many of the same goals
as previous distributed file systems such as performance, scalability, reliability, and
availability. However, its design has been driven by key observations of our application
workloads and technological environment, both current and anticipated, that reflect a
marked departure from some earlier file system design assumptions.
 This is one of the most important features of GFS. GFS was built specifically to meet the
growing processing demands that Google encountered. was the best way to meet those
demands?
To study and anticipate their workloads and operations and build a system that supports
that.
 The GFS is different than other file systems. The main differences between other file
systems and GFS are :
 Hardware failures are common for any system. The rate at which these failures occur
increase at a higher rate when the file system is as huge as the one at Google. The
quantity and quality of the components virtually guarantee that some are not functional at
15
all times. Due to this, error detection and continuous monitoring become essential. The
goal is to build software that helps in reducing these failures.
 GPS mainly supports huge files. This is natural given the amount of data Google
encounters and handles on a regular basis.
 Most files are mutated by appending new data rather than overwriting data. Random
writes within a file do not exist. This helps in optimization as well as atomicity
guarantees.
 The flexibility of the system is increased by co-designing the applications and the file
system API.
 The GFS supports usual file operations that include create, delete, read, write, open and
close. Along with this, GFS has two more additional operations, snapshot and record
append.
 Traditional append enables the writer to add or “append” to the end of the file. However,
this becomes complicated when two or more users want to append at the same time i.e.
concurrently. Normally, when such a situation arises, only one of the two append
operation is picked. However, for a system like the one Google uses, this can be time-
consuming as concurrent appends are encountered quite often. Let us take the example of
a user who searches the word “Universe.” There would be several web crawlers working
together on a file adding resources. Concurrent operations are bound to happen. In this
case, the results from multiple clients are merged together.
 Snapshot makes a copy of a file or directory tree almost immediately while minimizing
any interruptions of ongoing mutations. This is done to quickly create copies of the huge
dataset or to checkpoint the current state so that future changes can be roll backed. The
paper further explains how this works
 We use standard copy-on-write techniques to implement snapshots. When the master
receives a snapshot request, it first revokes any outstanding leases on the chunks in the
files it is about to snapshot. This ensures that any subsequent writes to these chunks will
require interaction with the master to find the leaseholder. This will give the master an
opportunity to create a new copy of the chunk first.
Hadoop Distributed File System (HDFS)

 The Hadoop Distributed File System (HDFS) is the primary data storage system used
by Hadoop applications. It employs a Name Node and Data Node architecture to
implement a distributed file system that provides high-performance access to data across
highly scalable Hadoop clusters.
 HDFS is a key part of the many Hadoop ecosystem technologies, as it provides a reliable
means for managing pools of big data and supporting related big data analytics
applications.
How HDFS works
 HDFS supports the rapid transfer of data between compute nodes. At its outset, it was
closely coupled with Map Reduce a programmatic framework for data processing.
 When HDFS takes in data, it breaks the information down into separate blocks and
distributes them to different nodes in a cluster, thus enabling highly efficient parallel
processing.
 Moreover, the Hadoop Distributed File System is specially designed to be highly fault-
tolerant. The file system replicates, or copies, each piece of data multiple times and
16
distributes the copies to individual nodes, placing at least one copy on a different server
rack than the others. As a result, the data on nodes that crash can be found elsewhere
within a cluster. This ensures that processing can continue while data is recovered.
 HDFS uses master/slave architecture. In its initial incarnation, each Hadoop
cluster consisted of a single Name Node that managed file system operations and
supporting Data Nodes that managed data storage on individual compute nodes. The
HDFS elements combine to support applications with large data sets.
 This master node "data chunking" architecture takes as its design guides elements from
Google File System (GFS), a proprietary file system outlined in in Google technical
papers, as well as IBM's General Parallel File System (GPFS), a format that boosts I/O by
striping blocks of data over multiple disks, writing blocks in parallel. While HDFS is not
Portable Operating System Interface model-compliant, it echoes POSIX design style in
some aspects.
 APACHE SOFTWARE FOUNDATION
 HDFS architecture centers on commanding Name Nodes that hold metadata and Data
Nodes that store information in blocks. Working at the heart of Hadoop, HDFS can
replicate data at great scale.
Why use HDFS?
 The Hadoop Distributed File System arose at Yahoo as a part of that company's ad
serving and search engine requirements. Like other web-oriented companies, Yahoo
found itself juggling a variety of applications that were accessed by a growing numbers
of users, who were creating more and more data. Face book, eBay, LinkedIn and Twitter
are among the web companies that used HDFS to underpin big data analytics to address
these same requirements.
 But the file system found use beyond that. HDFS was used by The New York Times as
part of large-scale image conversions, Media6Degrees for log processing and machine
learning, Live Bet for log storage and odds analysis, Joust for session analysis and Fox
Audience Network for log analysis and data mining. HDFS is also at the core of many
open source data warehouse alternatives, sometimes called data lakes.
 Because HDFS is typically deployed as part of very large-scale implementations, support
for low-cost commodity hardware is a particularly useful feature. Such systems, running
web search and related applications, for example, can range into the hundreds
of petabytes and thousands of nodes. They must be especially resilient, as server failures
are common at such scale.
 HDFS and Hadoop history
 In 2006, Hadoop's originators ceded their work on HDFS and Map Reduce to the Apache
Software Foundation project. The software was widely adopted in big data analytics
projects in a range of industries. In 2012, HDFS and Hadoop became available in Version
1.0.
 Margaret Rouse asks:
 How do you plan to implement open source distributed file systems in your organization?
 Join the Discussion
 The basic HDFS standard has been continuously updated since its inception.
 With Version 2.0 of Hadoop in 2013, a general-purpose YARN resource manager was
added, and Map Reduce and HDFS were effectively decoupled. Thereafter, diverse data
processing frameworks and file systems were supported by Hadoop. While Map Reduce
17
was often replaced by Apache Spark, HDFS continued to be a prevalent file format for
Hadoop.
 After four alpha releases and one beta, Apache Hadoop 3.0.0 became generally available
in December 2017, with HDFS enhancements supporting additional Name Nodes, erasure
coding facilities and greater data compression. At the same time, advances in HDFS
tooling, such as LinkedIn's open source Dr. Elephant and Dynamometer performance
testing tools, have expanded to enable development of ever larger HDFS
implementations.
 Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and
designed using low-cost hardware.
 HDFS holds very large amount of data and provides easier access. To store such huge
data, the files are stored across multiple machines. These files are stored in redundant
fashion to rescue the system from possible data losses in case of failure. HDFS also
makes applications available to parallel processing.
 Features of HDFS
 It is suitable for the distributed storage and processing.
 Hadoop provides a command interface to interact with HDFS.
 The built-in servers of name node and data node help users to easily check the status of
cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.
 HDFS Architecture
 Given below is the architecture of a Hadoop File System.
 HDFS follows the master-slave architecture and it has the following elements.
Features of HDFS:
1) Reliability:- Hadoop file system provides data storage that is highly reliable. It can save up to
100s of peta bytes of data. Data is stored in blocks that are further stored in racks on nodes in
clusters. It can have up to N number of clusters and so data is reliably stored in the blocks.
Replicas of these blocks are also created in the clusters in different machines in case of fault
tolerance. Hence, data is quickly available to users without any loss.
2) Fault Tolerance:- Fault Tolerance is how system handles all the unfavorable situations.
Hadoop File System is highly tolerant as it follows the block theory for better configuration. The
data in HDFS are divided into blocks and multiple copies of the blocks are created on different
machines. This replication is configurable and is done to avoid the loss of data. If one block in a
cluster goes down, the client can access the data from another machine having the copy of data
node.
 HDFS has different racks on which replicas of blocks of data are created, so in case a
machine fails user can access data from different rack present in another slave.
3) High Availability:- Hadoop file system as high availability. The block architecture is to
provide large availability of data. Block replications provide data availability when machine
fails. Whenever a client wants to access data, they can easily retrieve information from the
nearest node present in the cluster. During the time of machine failure data can be accessed from
the replicated blocks present in another rack in another salve of the cluster.
18
Replication:-This feature is the unique and essential feature of Hadoop file system. This feature
is added to resolve data loss issues which occurs due to hardware failure, crashing of nodes etc.
HDFS keeps on creating replicas on different machines in blocks in different clusters and
regularly maintains the replications. The default replication factor is three i.e. in one cluster there
are three copies of blocks.
Scalability:- Hadoop file system is highly scalable. The requirement increases as we scale the
data and hence the resources also increases like CPU, Memory, Disk etc. in the cluster. When
data is high, number of machines are also increased in the cluster.
Distributed Storage:- HDFS is a distributed file system. It stores files in the form of blocks of
fixed sizes and these blocks are stores across clusters of several machines. HDFS follows a
Master-Slave architecture in which the slave nodes (also called as the Data Nodes) form the
cluster which is managed by the master node (also called as the Name Node).
HDFS ARCHITECTURE
As mentioned earlier, HDFS follows a Master-Slave architecture in which the Master node is
called as the Name Node and the Slave node is called as Data Node. Name Node and Data
Node(s) are the building blocks of HDFS.
There is exclusive one Name Node and number of Data Nodes. The Data Nodes contain the
blocks of files in a distributed manner. Name node has the responsibility of managing the blocks
of files and allocation/deallocation of memory for the file blocks.
Master/Name NodeThe Name node stores the metadata of the whole file system, which contains
information about where each block of file and its replica is stored, the number of blocks of data,
the access rights for different users of the file system for a particular file, date of creation, date of
modification, etc. All the Data nodes send a Heartbeat message to the Name node at a fixed
interval to indicate that they are alive. Also, a block report is sent to the Name node which
contains all the information about the file blocks on that particular Data node.
There are 2 files associated with the Name node:

 FsImage: It stores the image/state of the name node since the starting of the
service
19
 Edit Logs: It stores all the current changes made to the file system along with the
file, block, and data node on which the file block is stored.
 The Name node is also responsible for maintaining the replication factor of the
block of files. Also, in case a data node fails, the Name node removes it from the
cluster, handles the reallocation of resources and redirects the traffic to another
data node.
Slave/Data Node:- Data node stores the data in the form of blocks of files. All the read-write
operations on files are performed on the data nodes and managed by the name node. All the data
nodes send a heartbeat message to the name node to indicate their health. The default interval for
that is set to 3 seconds, but it can be modified according to the need.
 Name node
 The name node is the commodity hardware that contains the GNU/Linux operating
system and the name node software. It is a software that can be run on commodity
hardware. The system having the name node acts as the master server and it does the
following tasks −
 Manages the file system namespace.
 Regulates client’s access to files.
 It also executes file system operations such as renaming, closing, and opening files and
directories.
 Data node
 The data node is a commodity hardware having the GNU/Linux operating system and
data node software. For every node (Commodity hardware/System) in a cluster, there will
be a data node. These nodes manage the data storage of their system.
 Datanodes perform read-write operations on the file systems, as per client request.
 They also perform operations such as block creation, deletion, and replication according
to the instructions of the name node.
Job Tracker
 Job Tracker process runs on a separate node and not usually on a Data Node.
 Job Tracker is an essential Daemon for Map Reduce execution in MRv1. It is replaced by
Resource Manager/Application Master in MRv2.
 Job Tracker receives the requests for Map Reduce execution from the client.
 Job Tracker talks to the Name Node to determine the location of the data.
 Job Tracker finds the best Task Tracker nodes to execute tasks based on the data locality
(proximity of the data) and the available slots to execute a task on a given node.
 Job Tracker monitors the individual Task Trackers and the submits back the overall status
of the job back to the client.
 Job Tracker process is critical to the Hadoop cluster in terms of Map Reduce execution.
 When the Job Tracker is down, HDFS will still be functional but the Map Reduce
execution can not be started and the existing Map Reduce jobs will be halted.
 Task Tracker
 Task Tracker runs on Data Node. Mostly on all Data Nodes.
 Task Tracker is replaced by Node Manager in MRv2.
 Mapped and Reducer tasks are executed on Data Nodes administered by Task Trackers.
 Task Trackers will be assigned Mapped and Reducer tasks to execute by Job Tracker.
 Task Tracker will be in constant communication with the Job Tracker signaling the
progress of the task in execution.
20
 Task Tracker failure is not considered fatal. When a Task Tracker becomes unresponsive,
Job Tracker will assign the task executed by the Task Tracker to another node.
Configure of hadoop cluster

What is Hadoop Cluster?
 A Hadoop cluster is nothing but a group of computers connected together via LAN. We
use it for storing and processing large data sets. Hadoop clusters have a number of
commodity hardware connected together. They communicate with a high-end machine
which acts as a master. These master and slaves implement distributed computing over
distributed data storage. It runs open source software for providing distributed
functionality.
Configure the System
 Create Host File on Each Node.
 Distribute Authentication Key-pairs for the Hadoop User.
 Download and Unpack Hadoop Binaries.
 Set Environment Variables.
 Set JAVA_HOME.
 Set Name Node Location
 Set path for HDFS.
 Set YARN as Job Schedule
1) Local Mode or Standalone Mode
 Standalone mode is the default mode in which Hadoop run. Standalone mode is mainly
used for debugging where you don’t really use HDFS. You can use input and output both
as a local file system in standalone mode.
 You also don’t need to do any custom configuration in the files- mapred-site.xml, core-
site.xml, hdfs-site.xml.
 Standalone mode is usually the fastest Hadoop modes as it uses the local file system for
all the input and output. Here is the summarized view of the standalone mode-
•Used for debugging purpose
• HDFS is not being used
• Uses local file system for input and output
• No need to change any configuration files
• Default Hadoop Modes
2) Pseudo-distributed Mode
 The pseudo-distribute mode is also known as a single-node cluster where both Name
Node and Data Node will reside on the same machine.
 In pseudo-distributed mode, all the Hadoop daemons will be running on a single node.
Such configuration is mainly used while testing when we don’t need to think about the
resources and other users sharing the resource.
 In this architecture, a separate JVM is spawned for every Hadoop components as they
could communicate across network sockets, effectively producing a fully functioning and
optimized mini-cluster on a single host.
 Here is the summarized view of pseudo distributed Mode-
 • Single Node Hadoop deployment running on Hadoop is considered as pseudo
distributed mode
• All the master & slave daemons will be running on the same node
21
• Mainly used for testing purpose
• Replication Factor will be ONE for Block
• Changes in configuration files will be required for all the three files- mapred-site.xml,
core-site.xml, hdfs-site.xml
3) Fully-Distributed Mode (Multi-Node Cluster)
 This is the production mode of Hadoop where multiple nodes will be running. Here data
will be distributed across several nodes and processing will be done on each node.
 Master and Slave services will be running on the separate nodes in fully-distributed
Hadoop Mode.
 • Production phase of Hadoop
• Separate nodes for master and slave daemons
• Data are used and distributed across multiple nodes
 In the Hadoop development, each Hadoop Modes have its own benefits and drawbacks.
Definitely fully distributed mode is the one for which Hadoop is mainly known for but
again there is no point in engaging the resource while in testing or debugging phase. So
standalone and pseudo-distributed Hadoop modes are also having their own significance.
Configuring XML files

 The config.xml file is the place where we can change the configuration of the app. When
we created our app in the last tutorial, we set reverse domain and name. The values can
be changed in the config.xml file. When we create the app, the default config file will
also be created.
1 widget
 The app reverse domain value that we specified when creating the app.
2 name
 The name of the app that we specified when creating the app.
3 description
 Description for the app.
4 author
 Author of the app.
5 content
 The app's starting page. It is placed inside the www directory.
6 plug-in
22
 The plug-in that are currently installed.
7 access
 Used to control access to external domains. The default origin value is set to * which
means that access is allowed to any domain. This value will not allow some specific
URLs to be opened to protect information.
8 allow-intent
 Allows specific URLs to ask the app to open. For example, <allow-intent href = "tel:*"
/> will allow tel: links to open the dialer.
9 platform
 The platforms for building the app.
What is an XML file used for?
 An XML file is an extensible markup language file, and it is used to structure data for
storage and transport. In an XML file, there are both tags and text. The tags provide the
structure to the data. The text in the file that you wish to store is surrounded by these tags,
which adhere to specific syntax guidelines.
XML code
 Featured snippet from the web
 Extensible Markup Language (XML) is a markup language that defines a set of rules for
encoding documents in a format that is both human-readable and machine-readable. ...
The design goals of XML emphasize simplicity, generality, and usability across the
Internet.
 Latest version: 1.1 (Second Edition); September 29, 2006; 13 years ago
 Related standards: XML Schema
 Domain: Data serialization
XML Example 1
<?xml version="1.0" encoding="UTF-8"?>
<note>
<to>love</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't Largent me this weekend!</body>
</note>
23

Unit 1

Uploaded by

Copyright:

Available Formats

Unit 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1

Uploaded by

Copyright:

Available Formats

UNIT-1

Introduction to Big Data & Hadoop

• Introduction To Big Data

• Need Of Big Data

• Big Data features and Challenges

• Problem with Traditional large-Scale System

• Source of Big Data

• 3 V’s Big Data

• Big Data Working

• Hadoop Distributed File System

• Building blocks of Hadoop

• Introducing and Configuring Hadoop cluster

• Configuring XML files

INTRODUCTION TO BIG DATA

Need of Big data

Types of Big Data

Disadvantages of Big Data:

Big Data Tools

Types of Data Analysis: Techniques and Methods

What is data in data mining?

Real-Time Data Processing

Big Data Case studies

Six Challenges in Big Data Integration

The Sources of Big Data

Difference between traditional data and big data

Volume GB Constantly Updated(TB or PB currently)

Structure Structured Semi-structured and unstructured

Data Integration easy Difficult

Access Interactive Batch or near real time

Data Structure Static Schema Dynamic Schema

Scaling Potential Non-linear Somewhat close to Linear

Examples of Patterns Derived from Social Media

Detection of diseases or Questionable Is someone a safe Is a machine component

Probability of Products you are

Solving the Data Problem

How can I access Big Data?

Google File System

Hadoop Distributed File System (HDFS)

There are 2 files associated with the Name node:

Configure of hadoop cluster

Configuring XML files

You might also like