Unit 1
Unit 1
Unit 1
2
Outline
3
UNIT 1
Semi-structured
Semi structured is the third type of big data. Semi-structured data pertains to the data
containing both the formats mentioned above, that is, structured and unstructured data. To be
precise, it refers to the data that although has not been classified under a particular repository
(database), yet contains vital information or tags that segregate individual elements within
the data. Thus we come to the end of types of data. Lets discuss the characteristics of data.
4
Characteristics of Big Data
Big Data can be defined by one or more of three characteristics,
1 volume
2 variety
3 velocity.
“Big data” is high-volume, velocity, and variety information assets that demand cost-effective,
innovative forms of information processing for enhanced insight and decision making.”
1) Variety
Variety of Big Data refers to structured, unstructured, and semi structured data that is gathered
from multiple sources. data could only be collected from spreadsheets and databases, today data
comes in an array of forms such as emails, PDFs, photos, videos, audios, SM posts, and so much
more. Variety is one of the important characteristics of big data.
2) Velocity
Velocity essentially refers to the speed at which data is being created in real-time. In a broader
prospect, it comprises the rate of change, linking of incoming data sets at varying speeds, and
activity bursts.
3) Volume
Volume is one of the characteristics of big data. We already know that Big Data indicates huge
‘volumes’ of data that is being generated on a daily basis from various sources like social media
platforms, business processes, machines, networks, human interactions, etc. Such a large amount
of data are stored in data warehouses. Thus comes to the end of characteristics of big data.
5
Advantages of Big Data:
➨Big data analysis derives innovative solutions. Big data analysis helps in understanding and
targeting customers. It helps in optimizing business processes.
➨It helps in improving science and research.
➨It improves healthcare and public health with availability of record of patients.
➨It helps in financial trading, sports, polling, security/law enforcement etc.
➨Any one can access vast information via surveys and deliver answer of any query.
➨Every second additions are made.
➨One platform carry unlimited information.
Data Storage and processing:- Data processing is the process of data management , which
enables creation of valid, useful information from the collected data. Data processing includes
classification, computation, coding and updating. Data storage refers to keeping data in the best
suitable format and in the best available medium.
Data analysis:- Data analysis is defined as a process of cleaning, transforming, and modeling
data to discover useful information for business decision-making. The purpose of Data Analysis
is to extract useful information from data and taking the decision based upon the data analysis.
6
Data Analysis Tools:-
Statistical Analysis
Statistical Analysis shows "What happen?" by using past data in the form of dashboards.
Statistical Analysis includes collection, Analysis, interpretation, presentation, and modeling of
data. It analyses a set of data or a sample of data. There are two categories of this type of
Analysis - Descriptive Analysis and Inferential Analysis.
Descriptive Analysis
Analyses complete data or a sample of summarized numerical data. It shows mean and deviation
for continuous data whereas percentage and frequency for categorical data.
Inferential Analysis
Analyses sample from complete data. In this type of Analysis, you can find different conclusions
from the same data by selecting different samples.
Diagnostic Analysis
Diagnostic Analysis shows "Why did it happen?" by finding the cause from the insight found in
Statistical Analysis. This Analysis is useful to identify behavior patterns of data. If a new
problem arrives in your business process, then you can look into this Analysis to find similar
patterns of that problem. And it may have chances to use similar prescriptions for the new
problems.
7
Predictive Analysis
Predictive Analysis shows "what is likely to happen" by using previous data. The simplest
example is like if last year I bought two dresses based on my savings and if this year my salary is
increasing double then I can buy four dresses. But of course it's not easy like this because you
have to think about other circumstances like chances of prices of clothes is increased this year or
maybe instead of dresses you want to buy a new bike, or you need to buy a house!
So here, this Analysis makes predictions about future outcomes based on current or past data.
Forecasting is just an estimate. Its accuracy is based on how much detailed information you have
and how much you dig in it.
Prescriptive Analysis
Prescriptive Analysis combines the insight from all previous Analysis to determine which action
to take in a current problem or decision. Most data-driven companies are utilizing Prescriptive
Analysis because predictive and descriptive Analysis are not enough to improve data
performance. Based on current situations and problems, they analyze the data and make
decisions.
No SQL Database:
No SQL databases (aka "not only SQL") are non tabular, and store data differently than
relational tables. No SQL databases come in a variety of types based on their data model. The
main types are document, key-value, wide-column, and graph.
8
Wal-mart leverages Big Data and Data Mining to create personalized product recommendations
for its customers. With the help of these two emerging technologies, Wal-mart can uncover
valuable patterns showing the most frequently bought products, most popular products, and even
the most popular product bundles (products that complement each other and are usually
purchased together).Based on these insights, Wal-mart creates attractive and customized
recommendations for individual users. By effectively implementing Data Mining techniques, the
retail giant has successfully increased the conversion rates and improved its customer service
substantially. Furthermore, Walmart uses Hadoop and NoSQL technologies to allow customers
to access real-time data accumulated from disparate sources.
2). American Express
The credit card giant leverages enormous volumes of customer data to identify indicators that
could depict user loyalty. It also uses Big Data to build advanced predictive models for analyzing
historical transactions along with 115 different variables to predict potential customer churn.
Thanks to Big Data solutions and tools, American Express can identify 24% of the accounts that
are highly likely to close in the upcoming four to five months.
3). Uber
Uber is one of the major cab service providers in the world. It leverages customer data to track
and identify the most popular and most used services by the users. Once this data is collected,
Uber uses data analytics to analyze the usage patterns of customers and determine which services
should be given more emphasis and importance. Apart from this, Uber uses Big Data in another
unique way. Uber closely studies the demand and supply of its services and changes the cab fares
accordingly. It is the surge pricing mechanism that works something like this – suppose when
you are in a hurry, and you have to book a cab from a crowded location, Uber will charge you
double the normal amount.
4). Netflix
Netflix is one of the most popular on-demand online video content streaming platform used by
people around the world. Netflix is a major proponent of the recommendation engine. It collects
customer data to understand the specific needs, preferences, and taste patterns of users. Then it
uses this data to predict what individual users will like and create personalized content
recommendation lists for them. Today, Netflix has become so vast that it is even creating unique
content for users. Data is the secret ingredient that fuels both its recommendation engines and
new content decisions. The most pivotal data points used by Netflix include titles that users
watch, user ratings, genres preferred, and how often users stop the playback, to name a few.
Hadoop, Hive, and Pig are the three core c the data structure used by Netflix.
9
2. Talent Gap in Big Data: It is difficult to win the respect from media and analysts in tech
without being bombarded with content touting the value of the analysis of big data and
corresponding reliance on a wide range of disruptive technologies. No SQL data management
frameworks, in-memory analytics, and as well as the broad Hadoop ecosystem. The reality is that
there is a lack of skills available in the market for big data technologies. The typical expert has
also gained experience through tool implementation and its use as a programming model, apart
from the big data management aspects.
3. Getting Data into Big Data Structure: It might be obvious that the intent of a big data
management involves analyzing and processing a large amount of data. There are many people
who have raised expectations considering analyzing huge data sets for a big data platform. They
also may not be aware of the complexity behind the transmission, access, and delivery of data
and information from a wide range of resources and then loading these data into a big data
platform. The intricate aspects of data transmission, access and loading are only part of the
challenge. The requirement to navigate transformation and extraction is not limited to
conventional relational data sets.
4. Syncing Across Data Sources: Once you import data into big data platforms you may also
realize that data copies migrated from a wide range of sources on different rates and schedules
can rapidly get out of the synchronization with the originating system. This implies that the data
coming from one source is not out of date as compared to the data coming from another source.
It also means the commonality of data definitions, concepts, metadata and the like. The
traditional data management and data warehouses, the sequence of data transformation,
extraction and migrations all arise the situation in which there are risks for data to become
unsynchronized.
5. Extracting Information from the Data in Big Data Integration: The most practical use
cases for big data involve the availability of data, augmenting existing storage of data as well as
allowing access to end-user employing business intelligence tools for the purpose of the
discovery of data. This business intelligence must be able to connect different big data platforms
and also provide transparency of the data consumers to eliminate the requirement of custom
coding. At the same time, if the number of data consumers grow, then one can provide a need to
support an increasing collection of many simultaneous user accesses. This increment of demand
may also spike at any time in reaction to different aspects of business process cycles. It also
becomes a challenge in big data integration to ensure the right-time data availability to the data
consumers.
6. Miscellaneous Challenges: Other challenges may occur while integrating big data. Some of
the challenges include integration of data, skill availability, solution cost, the volume of data, the
rate of transformation of data, veracity and validity of data. The ability to merge data that is not
similar in source or structure and to do so at a reasonable cost and in time. It is also a challenge
to process a large amount of data at a reasonable speed so that information is available for data
consumers when they need it.
10
1) Social data comes from the Likes, Tweets & Rewets, Comments, Video Uploads, and general
media that are uploaded and shared via the world’s favorite social media platforms. This kind of
data provides invaluable insights into consumer behavior and sentiment and can be enormously
influential in marketing analytics. The public web is another good source of social data, and tools
like Google Trends can be used to good effect to increase the volume of big data.
2) Machine data is defined as information which is generated by industrial equipment, sensors
that are installed in machinery, and even web logs which track user behavior. This type of data is
expected to grow exponentially as the internet of things grows ever more pervasive and expands
around the world. Sensors such as medical devices, smart meters, road cameras, satellites, games
and the rapidly growing Internet Of Things will deliver high velocity, value, volume and variety
of data in the very near future.
3) Transactional data is generated from all the daily transactions that take place both online and
offline. Invoices, payment orders, storage records, delivery receipts – all are characterized as
transactional data yet data alone is almost meaningless, and most organizations struggle to make
sense of the data that they are generating and how it can be put to good use.
Generated Rate Per hour, per day… More rapid(almost every second)
11
Activity and growth Credit card fraud Is someone having Designing roads to reflect
patterns an affair? traffic patterns and
activity in different areas
Probability of a heart Identify process Who you will vote Activity and growth
attack or stroke failures and for patterns
security
breaches
Are you an alcoholic? How money is Products in your Identify process failures
spent home and security breaches
The outbreak of a virus How much Are you likely to Driving patterns in a city
money you make commit a crime?
Purchase What you do for A good place to put a
patterns relaxation store or business
Products you are How you use a Brand loyalty and why
likely to buy website people switch brands
Data Challenges
They say that necessity is the mother of all invention. That definitely holds true for data. Banks,
governments, insurance firms, manufacturing companies, health institutions, and retail
companies all realized the issues of working with these large volumes of data. Yet, it was the
Internet companies that were forced to solve it. Organizations such as Google, Yahoo!,
Facebook, and eBay were ingesting massive volumes of data that were increasing in size and
velocity every day, and to stay in business they had to solve this data problem. Google wanted to
be able to rank the Internet. It knew the data volume was large and would grow larger every day.
It went to the traditional database and storage vendors and saw that the costs of using their
software licenses and storage technology was so prohibitive they could not even be considered.
So Google realized it needed a new technology and a new way of addressing the data challenges.
Data problem
Google realized that if it wanted to be able to rank the Internet, it had to design a new
way of solving the problem. It started with looking at what was needed:
Inexpensive storage that could store massive amounts of data cost effectively
To scale cost effectively as the data volume continued to increase
12
To analyze these large data volumes very fast
To be able to correlate semi-structured and unstructured data with existing structured data
To work with unstructured data that had many forms that could change frequently; for
example, data structures from organizations such as Twitter can change regularly
Google also identified the problems:
The traditional storage vendor solutions were too expensive.
When processing very large volumes of data at the level of hundreds of terabytes and
Peta bytes, technologies based on “shared block-level storage” were too slow and
couldn’t scale cost effectively. Relational databases and data warehouses were not
designed for the new level of scale of data ingestion, storage, and processing that was
required. Today’s data scale requires a high-performance super-computer platform that
could scale at cost.
The processing model of relational databases that read data in 8k and 16k increments and
then loaded the data into memory to be accessed by software programs was too
inefficient for working with large volumes of data.
The traditional relational database and data warehouse software licenses were too
expensive for the scale of data Google needed.
The architecture and processing models of relational databases and data warehouses were
designed to handle transactions for a world that existed 30 to 40 years ago. These
architectures and processing models were not designed to process the semi-structured and
unstructured data coming from social media, machine sensors, GPS coordinates, and
RFID. Solutions to address these challenges are so expensive that organizations wanted
another choice.
Reducing business data latency was needed. Business data latency is the differential
between the time when data is stored to the time when the data can be analyzed to solve
business problems.
Google needed a large single data repository to store all the data. Walk into any large
organization and it typically has thousands of relational databases along with a number of
different data warehouse and business analysis solutions. All these data platforms stored
their data in their own independent silos. The data needed to be correlated and analyzed
with different datasets to maximize business value. Moving data across data silos is
expensive, requires lots of resources, and significantly slows down the time to business
insight.
13
A data platform that could handle large volumes of data and be linearly scalable at cost
and performance.
A highly parallel processing model that was highly distributed to access and compute the
data very fast.
A data repository that could break down the silos and store structured, semi-structured,
and unstructured data to make it easy to correlate and analyze the data together.
These are still recommended readings because they lay down the foundation for the
processing and storage of Hadoop. These articles are also insightful because they define
the business drivers and technical challenges Google wanted to solve.
The Necessity and Environment for Solving the Data Problem
The environment that solved the problem turned out to be Silicon Valley in California,
and the culture was open source. In Silicon Valley, a number of Internet companies had
to solve the same problem to stay in business, but they needed to be able to share and
exchange ideas with other smart people who could add the additional components.
Silicon Valley is unique in that it has a large number of startup and Internet companies
that by their nature are innovative, believe in open source, and have a large amount of
cross-pollination in a very condensed area. Open source is a culture of exchanging ideas
and writing software from individuals and companies around the world. Larger
proprietary companies might have hundreds or thousands of engineers and customers, but
open source has tens of thousands to millions of individuals who can write software and
download and test software.
Individuals from Google, Yahoo!, and the open source community created a solution for
the data problem called Hadoop. Hadoop was created for a very important reason—
survival. The Internet companies needed to solve this data problem to stay in business
and be able to grow.
Data Mining
Data mining is the process of discovering insights within a database. The aim of this is to
provide predictions and make decisions based on the data currently held.
14
Data Analysis
Once all the data has been collected it needs to be analyzed to look for interesting
patterns and trends. A good data analyst will spot something out of the ordinary, or
something that hasn’t been reported by anyone else.
Data Visualization
Perhaps the most important is the visualization of the data. This is the part that takes all
the work done prior and outputs a visualization that ideally anyone can understand. This
can be done using programming languages such as Plot.ly and d3.js or software such as
Tableau.
15
all times. Due to this, error detection and continuous monitoring become essential. The
goal is to build software that helps in reducing these failures.
GPS mainly supports huge files. This is natural given the amount of data Google
encounters and handles on a regular basis.
Most files are mutated by appending new data rather than overwriting data. Random
writes within a file do not exist. This helps in optimization as well as atomicity
guarantees.
The flexibility of the system is increased by co-designing the applications and the file
system API.
The GFS supports usual file operations that include create, delete, read, write, open and
close. Along with this, GFS has two more additional operations, snapshot and record
append.
Traditional append enables the writer to add or “append” to the end of the file. However,
this becomes complicated when two or more users want to append at the same time i.e.
concurrently. Normally, when such a situation arises, only one of the two append
operation is picked. However, for a system like the one Google uses, this can be time-
consuming as concurrent appends are encountered quite often. Let us take the example of
a user who searches the word “Universe.” There would be several web crawlers working
together on a file adding resources. Concurrent operations are bound to happen. In this
case, the results from multiple clients are merged together.
Snapshot makes a copy of a file or directory tree almost immediately while minimizing
any interruptions of ongoing mutations. This is done to quickly create copies of the huge
dataset or to checkpoint the current state so that future changes can be roll backed. The
paper further explains how this works
We use standard copy-on-write techniques to implement snapshots. When the master
receives a snapshot request, it first revokes any outstanding leases on the chunks in the
files it is about to snapshot. This ensures that any subsequent writes to these chunks will
require interaction with the master to find the leaseholder. This will give the master an
opportunity to create a new copy of the chunk first.
16
distributes the copies to individual nodes, placing at least one copy on a different server
rack than the others. As a result, the data on nodes that crash can be found elsewhere
within a cluster. This ensures that processing can continue while data is recovered.
HDFS uses master/slave architecture. In its initial incarnation, each Hadoop
cluster consisted of a single Name Node that managed file system operations and
supporting Data Nodes that managed data storage on individual compute nodes. The
HDFS elements combine to support applications with large data sets.
This master node "data chunking" architecture takes as its design guides elements from
Google File System (GFS), a proprietary file system outlined in in Google technical
papers, as well as IBM's General Parallel File System (GPFS), a format that boosts I/O by
striping blocks of data over multiple disks, writing blocks in parallel. While HDFS is not
Portable Operating System Interface model-compliant, it echoes POSIX design style in
some aspects.
APACHE SOFTWARE FOUNDATION
HDFS architecture centers on commanding Name Nodes that hold metadata and Data
Nodes that store information in blocks. Working at the heart of Hadoop, HDFS can
replicate data at great scale.
Why use HDFS?
The Hadoop Distributed File System arose at Yahoo as a part of that company's ad
serving and search engine requirements. Like other web-oriented companies, Yahoo
found itself juggling a variety of applications that were accessed by a growing numbers
of users, who were creating more and more data. Face book, eBay, LinkedIn and Twitter
are among the web companies that used HDFS to underpin big data analytics to address
these same requirements.
But the file system found use beyond that. HDFS was used by The New York Times as
part of large-scale image conversions, Media6Degrees for log processing and machine
learning, Live Bet for log storage and odds analysis, Joust for session analysis and Fox
Audience Network for log analysis and data mining. HDFS is also at the core of many
open source data warehouse alternatives, sometimes called data lakes.
Because HDFS is typically deployed as part of very large-scale implementations, support
for low-cost commodity hardware is a particularly useful feature. Such systems, running
web search and related applications, for example, can range into the hundreds
of petabytes and thousands of nodes. They must be especially resilient, as server failures
are common at such scale.
HDFS and Hadoop history
In 2006, Hadoop's originators ceded their work on HDFS and Map Reduce to the Apache
Software Foundation project. The software was widely adopted in big data analytics
projects in a range of industries. In 2012, HDFS and Hadoop became available in Version
1.0.
Margaret Rouse asks:
How do you plan to implement open source distributed file systems in your organization?
Join the Discussion
The basic HDFS standard has been continuously updated since its inception.
With Version 2.0 of Hadoop in 2013, a general-purpose YARN resource manager was
added, and Map Reduce and HDFS were effectively decoupled. Thereafter, diverse data
processing frameworks and file systems were supported by Hadoop. While Map Reduce
17
was often replaced by Apache Spark, HDFS continued to be a prevalent file format for
Hadoop.
After four alpha releases and one beta, Apache Hadoop 3.0.0 became generally available
in December 2017, with HDFS enhancements supporting additional Name Nodes, erasure
coding facilities and greater data compression. At the same time, advances in HDFS
tooling, such as LinkedIn's open source Dr. Elephant and Dynamometer performance
testing tools, have expanded to enable development of ever larger HDFS
implementations.
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and
designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge
data, the files are stored across multiple machines. These files are stored in redundant
fashion to rescue the system from possible data losses in case of failure. HDFS also
makes applications available to parallel processing.
Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of name node and data node help users to easily check the status of
cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
HDFS Architecture
Given below is the architecture of a Hadoop File System.
HDFS follows the master-slave architecture and it has the following elements.
Features of HDFS:
1) Reliability:- Hadoop file system provides data storage that is highly reliable. It can save up to
100s of peta bytes of data. Data is stored in blocks that are further stored in racks on nodes in
clusters. It can have up to N number of clusters and so data is reliably stored in the blocks.
Replicas of these blocks are also created in the clusters in different machines in case of fault
tolerance. Hence, data is quickly available to users without any loss.
2) Fault Tolerance:- Fault Tolerance is how system handles all the unfavorable situations.
Hadoop File System is highly tolerant as it follows the block theory for better configuration. The
data in HDFS are divided into blocks and multiple copies of the blocks are created on different
machines. This replication is configurable and is done to avoid the loss of data. If one block in a
cluster goes down, the client can access the data from another machine having the copy of data
node.
HDFS has different racks on which replicas of blocks of data are created, so in case a
machine fails user can access data from different rack present in another slave.
3) High Availability:- Hadoop file system as high availability. The block architecture is to
provide large availability of data. Block replications provide data availability when machine
fails. Whenever a client wants to access data, they can easily retrieve information from the
nearest node present in the cluster. During the time of machine failure data can be accessed from
the replicated blocks present in another rack in another salve of the cluster.
18
Replication:-This feature is the unique and essential feature of Hadoop file system. This feature
is added to resolve data loss issues which occurs due to hardware failure, crashing of nodes etc.
HDFS keeps on creating replicas on different machines in blocks in different clusters and
regularly maintains the replications. The default replication factor is three i.e. in one cluster there
are three copies of blocks.
Scalability:- Hadoop file system is highly scalable. The requirement increases as we scale the
data and hence the resources also increases like CPU, Memory, Disk etc. in the cluster. When
data is high, number of machines are also increased in the cluster.
Distributed Storage:- HDFS is a distributed file system. It stores files in the form of blocks of
fixed sizes and these blocks are stores across clusters of several machines. HDFS follows a
Master-Slave architecture in which the slave nodes (also called as the Data Nodes) form the
cluster which is managed by the master node (also called as the Name Node).
HDFS ARCHITECTURE
As mentioned earlier, HDFS follows a Master-Slave architecture in which the Master node is
called as the Name Node and the Slave node is called as Data Node. Name Node and Data
Node(s) are the building blocks of HDFS.
There is exclusive one Name Node and number of Data Nodes. The Data Nodes contain the
blocks of files in a distributed manner. Name node has the responsibility of managing the blocks
of files and allocation/deallocation of memory for the file blocks.
Master/Name NodeThe Name node stores the metadata of the whole file system, which contains
information about where each block of file and its replica is stored, the number of blocks of data,
the access rights for different users of the file system for a particular file, date of creation, date of
modification, etc. All the Data nodes send a Heartbeat message to the Name node at a fixed
interval to indicate that they are alive. Also, a block report is sent to the Name node which
contains all the information about the file blocks on that particular Data node.
19
Edit Logs: It stores all the current changes made to the file system along with the
file, block, and data node on which the file block is stored.
The Name node is also responsible for maintaining the replication factor of the
block of files. Also, in case a data node fails, the Name node removes it from the
cluster, handles the reallocation of resources and redirects the traffic to another
data node.
Slave/Data Node:- Data node stores the data in the form of blocks of files. All the read-write
operations on files are performed on the data nodes and managed by the name node. All the data
nodes send a heartbeat message to the name node to indicate their health. The default interval for
that is set to 3 seconds, but it can be modified according to the need.
Name node
The name node is the commodity hardware that contains the GNU/Linux operating
system and the name node software. It is a software that can be run on commodity
hardware. The system having the name node acts as the master server and it does the
following tasks −
Manages the file system namespace.
Regulates client’s access to files.
It also executes file system operations such as renaming, closing, and opening files and
directories.
Data node
The data node is a commodity hardware having the GNU/Linux operating system and
data node software. For every node (Commodity hardware/System) in a cluster, there will
be a data node. These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according
to the instructions of the name node.
Job Tracker
Job Tracker process runs on a separate node and not usually on a Data Node.
Job Tracker is an essential Daemon for Map Reduce execution in MRv1. It is replaced by
Resource Manager/Application Master in MRv2.
Job Tracker receives the requests for Map Reduce execution from the client.
Job Tracker talks to the Name Node to determine the location of the data.
Job Tracker finds the best Task Tracker nodes to execute tasks based on the data locality
(proximity of the data) and the available slots to execute a task on a given node.
Job Tracker monitors the individual Task Trackers and the submits back the overall status
of the job back to the client.
Job Tracker process is critical to the Hadoop cluster in terms of Map Reduce execution.
When the Job Tracker is down, HDFS will still be functional but the Map Reduce
execution can not be started and the existing Map Reduce jobs will be halted.
Task Tracker
Task Tracker runs on Data Node. Mostly on all Data Nodes.
Task Tracker is replaced by Node Manager in MRv2.
Mapped and Reducer tasks are executed on Data Nodes administered by Task Trackers.
Task Trackers will be assigned Mapped and Reducer tasks to execute by Job Tracker.
Task Tracker will be in constant communication with the Job Tracker signaling the
progress of the task in execution.
20
Task Tracker failure is not considered fatal. When a Task Tracker becomes unresponsive,
Job Tracker will assign the task executed by the Task Tracker to another node.
21
• Mainly used for testing purpose
• Replication Factor will be ONE for Block
• Changes in configuration files will be required for all the three files- mapred-site.xml,
core-site.xml, hdfs-site.xml
3) Fully-Distributed Mode (Multi-Node Cluster)
This is the production mode of Hadoop where multiple nodes will be running. Here data
will be distributed across several nodes and processing will be done on each node.
Master and Slave services will be running on the separate nodes in fully-distributed
Hadoop Mode.
• Production phase of Hadoop
• Separate nodes for master and slave daemons
• Data are used and distributed across multiple nodes
In the Hadoop development, each Hadoop Modes have its own benefits and drawbacks.
Definitely fully distributed mode is the one for which Hadoop is mainly known for but
again there is no point in engaging the resource while in testing or debugging phase. So
standalone and pseudo-distributed Hadoop modes are also having their own significance.
1 widget
The app reverse domain value that we specified when creating the app.
2 name
The name of the app that we specified when creating the app.
3 description
Description for the app.
4 author
Author of the app.
5 content
The app's starting page. It is placed inside the www directory.
6 plug-in
22
The plug-in that are currently installed.
7 access
Used to control access to external domains. The default origin value is set to * which
means that access is allowed to any domain. This value will not allow some specific
URLs to be opened to protect information.
8 allow-intent
Allows specific URLs to ask the app to open. For example, <allow-intent href = "tel:*"
/> will allow tel: links to open the dialer.
9 platform
The platforms for building the app.
What is an XML file used for?
An XML file is an extensible markup language file, and it is used to structure data for
storage and transport. In an XML file, there are both tags and text. The tags provide the
structure to the data. The text in the file that you wish to store is surrounded by these tags,
which adhere to specific syntax guidelines.
XML code
Featured snippet from the web
Extensible Markup Language (XML) is a markup language that defines a set of rules for
encoding documents in a format that is both human-readable and machine-readable. ...
The design goals of XML emphasize simplicity, generality, and usability across the
Internet.
Latest version: 1.1 (Second Edition); September 29, 2006; 13 years ago
Related standards: XML Schema
Domain: Data serialization
XML Example 1
<?xml version="1.0" encoding="UTF-8"?>
<note>
<to>love</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't Largent me this weekend!</body>
</note>
23