Introduction To Big Data Analytics
Introduction To Big Data Analytics
Introduction To Big Data Analytics
Data Analytics
DR. MALA KALRA
ASSISTANT PROFESSOR
NITTTR, CHANDIGARH
Contents
• Evolution of Big Data
• Definition of Big Data
• 5 V’s of Big Data
• What is Big Data Analytics
• Big Data Analytics Use Cases
• Big Data as a Boon
• Big Data Analytics Tools
Evolution of Big Data
Advancement in Technology
Telephone Car
Source: https://gigazine.net/gsc_news/en/20170412-iot-market-2023
Social Media
Source: Edureka
Other Factors
• Retail
• Customer Transactions (Walmart handles more than 1 million
customer transactions every hour)
• Search history details
•Data quantity
Velocity
•Data Speed
Variety
•Data Types
Value
•Meaningful Insights
Veracity
•Quality of Data
Volume
Volume refers to the total amount of data.
• The smart phones, the data they create and consume; sensors
embedded into everyday objects will soon result in billions of
new, constantly-updated data feeds containing environmental,
location, and other information.
Volume
Velocity: Measure of how fast the data is coming in.
Data generated in one minute in the digital world
Variety
Variety
• Structured Data
• Data which can be stored in database in table with rows and columns.
• Have relational key and can be easily mapped into pre-designed fields.
• Eg. Relational Database
• Unstructured Data
• Data that cannot be organized in a pre-defined manner.
• Eg. Email messages , Audio files, Video files
• Represent around 80% of data.
• Growing quickest than others, and their exploitation could help in business decision
Veracity
• Veracity is the quality or trustworthiness of the data.
• Is the Data collected from Twitter posts with hash tags, abbreviations, typos, etc.,
reliable and accurate?
• Do you trust the data that you have collected?
• Is this data credible enough to collect insights from?
• Should we be basing our business decisions on the insights gathered from this
data?
• When processing big data sets, it is important that the validity of the data is
checked before proceeding for processing.
Value
• Refers to worth of the data being extracted.
• The bulk of Data having no Value is not at all useful for the company.
• Data needs to be converted into something valuable to achieve business gains.
• By calculating the total cost of processing big data and comparing it with the
ROI that the business insights are expected to generate, companies can
effectively decide whether or not big data analytics will actually add any value
to their business.
Challenges with Big Data
Storing exponentially growing huge data sets
Data Governance
Security Issues
What is Big Data Analytics?
• The more Amazon knows about you, the better it can predict what you want to buy.
• May make recommendations based on what other customers with similar profile bought.
• Also helps in retaining their customers
Amazon –Improving user experience
• Analysing the clicks of every visitor on their website aids them in
• understanding their site-navigation behaviour
• paths the user took to buy the product
• paths that led them to leave the site and more
• Big data solutions analyze that information — often in real time — to detect
when a problem is about to occur.
• Perform preventive maintenance that may help prevent accidents or costly line
shutdowns.
Healthcare
• Big Data in healthcare pertains to the massive amounts of healthcare data gathered from
multiple sources such as
• wearable devices like fitness trackers, smartwatches, sensors, etc.
• Biometric data such as X-Rays images, MRI scans
• medical documents
• EHR (electronic health records) which includes demographics, medical history, allergies, laboratory
test results etc.
• Helps doctors or medical specialists create databases.
• Vital for the prediction of inherited diseases. For instance, the patients at risk of developing a
specific disease (e.g. diabetes) can benefit from preventive care.
• May make suggestions for each patient according to the information collected from other
human beings.
Healthcare (Contd..)
• In hospitals, Clinical Decision Support (CDS) software analyzes medical data on the
spot, providing health practitioners with advice as they diagnose patients and write
prescriptions.
• Real-Time Alerting
• Wearables collect patients’ health data continuously and provide doctors with information
and changes.
• if something is wrong, an alert will be automatically sent to the doctor or another
specialist.
• doctor is able to contact the patients without further delay and give them all the necessary
instructions.
Insurance Fraud
• Insurers analyze their internal data to gain insight into potentially fraudulent claims, such as
• call center notes and voice recordings,
• social media data
• third party details on people’s bills, wages, bankruptcies, criminal records, and address changes
• For example, while a claimant may declare their car was damaged by flooding, their social
media feed may indicate weather conditions were sunny on the day of the supposed incident.
• Insurers can supplement this data with text analytics technology that can detect minor
discrepancies hidden in a claimant’s case report.
• Fraudsters tend to alter their story over time, making this a powerful tool in detecting criminal
activity.
Efficient
utilization of
Cost Savings
• Hadoop is an open
source software built to
handle very large data
sets.
NameNode
HDFS
406 MB
• HDFS provides
reliability by
replicating blocks to
different nodes.
• The default
replication factor is 3
which is
configurable.
MapReduce
• MapReduce is the processing engine of Hadoop.
• Hadoop processes data by delivering code to nodes to process in parallel.
• Map function transforms the piece of data into key-value pairs
• Then the keys are sorted where a reduce function is applied to merge the values
based on the key into a single output.
Working of MapReduce
MapReduce - Parallel Processing
• Job is divided among multiple nodes and each
node works with a part of the job simultaneously.