Introduction To Big Data Analytics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

Introduction to Big

Data Analytics
DR. MALA KALRA
ASSISTANT PROFESSOR
NITTTR, CHANDIGARH
Contents
• Evolution of Big Data
• Definition of Big Data
• 5 V’s of Big Data
• What is Big Data Analytics
• Big Data Analytics Use Cases
• Big Data as a Boon
• Big Data Analytics Tools
Evolution of Big Data
Advancement in Technology

Telephone Car

Smartphone Smart Car


Internet of Things

Source: https://gigazine.net/gsc_news/en/20170412-iot-market-2023
Social Media

Source: Edureka
Other Factors

• Retail
• Customer Transactions (Walmart handles more than 1 million
customer transactions every hour)
• Search history details

• Banking and Finance


• Healthcare
• Research Activities (Decoding the human genome)
CERN’s Large Hydron Collider (LHC) generates 15 PB a year
Maximilien Brice, © CERN
What is Big Data?
What is Big Data?
As per Wikipedia
“Big data is a term for data sets that are so large or
complex that traditional data processing
applications are inadequate to deal with them.”
5V’s of Big Data
5V’s of Big Data
Volume

•Data quantity

Velocity

•Data Speed

Variety

•Data Types

Value

•Meaningful Insights

Veracity

•Quality of Data
Volume
Volume refers to the total amount of data.

•Today, Facebook ingests 500 terabytes of new data every day.

•Boeing 737 will generate 240 terabytes of flight data during a


single flight across the US.

• The smart phones, the data they create and consume; sensors
embedded into everyday objects will soon result in billions of
new, constantly-updated data feeds containing environmental,
location, and other information.
Volume
Velocity: Measure of how fast the data is coming in.
Data generated in one minute in the digital world
Variety
Variety

• Structured Data
• Data which can be stored in database in table with rows and columns.
• Have relational key and can be easily mapped into pre-designed fields.
• Eg. Relational Database

• Semi Structured Data


• Partially organised data
• XML, CSV files

• Unstructured Data
• Data that cannot be organized in a pre-defined manner.
• Eg. Email messages , Audio files, Video files
• Represent around 80% of data.
• Growing quickest than others, and their exploitation could help in business decision
Veracity
• Veracity is the quality or trustworthiness of the data.
• Is the Data collected from Twitter posts with hash tags, abbreviations, typos, etc.,
reliable and accurate?
• Do you trust the data that you have collected?
• Is this data credible enough to collect insights from?
• Should we be basing our business decisions on the insights gathered from this
data?
• When processing big data sets, it is important that the validity of the data is
checked before proceeding for processing.
Value
• Refers to worth of the data being extracted.
• The bulk of Data having no Value is not at all useful for the company.
• Data needs to be converted into something valuable to achieve business gains.
• By calculating the total cost of processing big data and comparing it with the
ROI that the business insights are expected to generate, companies can
effectively decide whether or not big data analytics will actually add any value
to their business.
Challenges with Big Data
Storing exponentially growing huge data sets

Integrating disparate data sources

Generating insights in a timely manner

Data Governance

Security Issues
What is Big Data Analytics?

• Examining large and different types of data to


uncover hidden patterns, correlations and other
insights.
Big Data Analytics Use Cases
Amazon’s “360-degree view”
• Amazon uses Big Data gathered from customers to build and fine-tune its recommendation
engine.
• what you buy
• your reviews/feedback
• any personal details
• your shipping address (to guess your income level based on where you live)
• Browsing behavior

• The more Amazon knows about you, the better it can predict what you want to buy.
• May make recommendations based on what other customers with similar profile bought.
• Also helps in retaining their customers
Amazon –Improving user experience
• Analysing the clicks of every visitor on their website aids them in
• understanding their site-navigation behaviour
• paths the user took to buy the product
• paths that led them to leave the site and more

• All this information helps Amazon to improve user experience, thereby


improving their sales and marketing.
Social Media Analysis and Response
• Companies monitor what people are saying about their products or
services in social media
• Collect and analyze the posts on Facebook, Twitter, Instagram etc.
• Helps in improving their products
• Improve Customer Satisfaction Retaining customers
IIoT - Preventive Maintenance and Support
• Factories and other facilities that use expensive equipment are deploying sensors
• to monitor that equipment and transmit relevant data over the Internet.

• Big data solutions analyze that information — often in real time — to detect
when a problem is about to occur.
• Perform preventive maintenance that may help prevent accidents or costly line
shutdowns.
Healthcare
• Big Data in healthcare pertains to the massive amounts of healthcare data gathered from
multiple sources such as
• wearable devices like fitness trackers, smartwatches, sensors, etc.
• Biometric data such as X-Rays images, MRI scans
• medical documents
• EHR (electronic health records) which includes demographics, medical history, allergies, laboratory
test results etc.
• Helps doctors or medical specialists create databases.
• Vital for the prediction of inherited diseases. For instance, the patients at risk of developing a
specific disease (e.g. diabetes) can benefit from preventive care.
• May make suggestions for each patient according to the information collected from other
human beings.
Healthcare (Contd..)
• In hospitals, Clinical Decision Support (CDS) software analyzes medical data on the
spot, providing health practitioners with advice as they diagnose patients and write
prescriptions.
• Real-Time Alerting
• Wearables collect patients’ health data continuously and provide doctors with information
and changes.
• if something is wrong, an alert will be automatically sent to the doctor or another
specialist.
• doctor is able to contact the patients without further delay and give them all the necessary
instructions.
Insurance Fraud
• Insurers analyze their internal data to gain insight into potentially fraudulent claims, such as
• call center notes and voice recordings,
• social media data
• third party details on people’s bills, wages, bankruptcies, criminal records, and address changes

• For example, while a claimant may declare their car was damaged by flooding, their social
media feed may indicate weather conditions were sunny on the day of the supposed incident.
• Insurers can supplement this data with text analytics technology that can detect minor
discrepancies hidden in a claimant’s case report.
• Fraudsters tend to alter their story over time, making this a powerful tool in detecting criminal
activity.
Efficient
utilization of

Big Data as a Boon big data is the


key to business
growth

Cost Savings

Faster and Better Decision Making

Understand the market conditions

Next Generation products

Evaluation of customer satisfaction


Big Data Analytics Tools
Apache Hadoop

• Hadoop is an open
source software built to
handle very large data
sets.
NameNode

HDFS

• HDFS is the storage component of Hadoop. DataNodes

• Stores data by splitting files into blocks of pre-determined size


• Blocks are stored across a cluster of nodes
• Follows a Master/Slave Architecture, where a cluster comprises of a single
NameNode (Master node) and all the other nodes are DataNodes (Slave nodes).
HDFS Core Components

NameNode Secondary NameNode

DataNode DataNode DataNode


NameNode
• Master node in the HDFS Architecture
• Maintains and manages the blocks present on the DataNodes
• highly available server that controls access to files by clients.
• records the metadata of all the files stored in the cluster, e.g. location of blocks stored, size
of files etc.
• Metadata is maintained using two files:
• FsImage: contains all the modifications ever made across the Hadoop Cluster since the start of
the NameNode (stored on disk).
• EditLogs: contains all the recent modifications made – may be in last 1 hr (stored in RAM).
NameNode
• responsible to take care of the replication factor
• receives a Heartbeat (default 3 sec) and a block report from all the DataNodes to ensure
DataNodes are live
• Responsible for choosing new DataNodes in case of DataNode failure
DataNodes
• slave nodes in HDFS
• Stores actual data
• commodity hardware - non-expensive system which is not of high quality or
high-availability
• Serve read and write requests from clients
Secondary NameNode
• Secondary NameNode works concurrently with the primary NameNode as a helper daemon
• Responsible for performing checkpointing.
• combining the EditLogs with FsImage from the NameNode.
• downloads the EditLogs from the NameNode at regular intervals and applies to FsImage.
• The new FsImage is copied back to the NameNode, which is used whenever the NameNode is
started the next time.
• Performs regular checkpoints (deafault 1 hr) and hence also called as CheckpointNode.
• During checkpointing, latest changes are stored on new EditLogs file
Secondary NameNode
and Checkpointing
HDFS Data Blocks

406 MB

Block1 Block 2 Block 3 Block 4


128 MB 128 MB 128 MB 22 MB
Replication Management

• HDFS provides
reliability by
replicating blocks to
different nodes.

• The default
replication factor is 3
which is
configurable.
MapReduce
• MapReduce is the processing engine of Hadoop.
• Hadoop processes data by delivering code to nodes to process in parallel.
• Map function transforms the piece of data into key-value pairs
• Then the keys are sorted where a reduce function is applied to merge the values
based on the key into a single output.
Working of MapReduce
MapReduce - Parallel Processing
• Job is divided among multiple nodes and each
node works with a part of the job simultaneously.

• As the data is processed by multiple machines


instead of a single machine in parallel, the time
taken to process the data gets reduced by a
tremendous amount.
Benefits
Big data analytics tools
• Hive
• Hive was designed to automatically translate SQL-like
queries into MapReduce jobs on Hadoop—all through the
use of a language called HiveQL.
• Spark
• Spark is frequently used as an alternate to Hadoop’s
MapReduce because it it is able to analyze data up to 100
times faster for certain applications.
• Common use cases for Apache Spark include streaming
data, machine learning and interactive analysis.
Big data analytics tools
• NoSQL databases
• RDBMS could no longer meet new data management needs of
Big Data.
• Not Only SQL (NoSQL) allows storing, retrieving and
analyzing massively large volumes of disparate and complex
data at lightning fast speeds.

You might also like