Module-1-Introduction To BigData Platform

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 21

MODULE-1

Introduction to BigData Platform


Contents
 Challenges of Conventional Systems –
 Intelligent data analysis –
 Nature of Data –
 Analytic Processes and Tools –
 Analysis vs Reporting –
 Modern Data Analytic Tools –
 Statistical Concepts: Sampling Distributions - Re-Sampling
- Statistical Inference - Prediction Error.

Challenges of Conventional Systems

Challenges of Conventional System in big data

 Scalability
 Speed
 Storage
 Data Integration
 Security

Scalability

 A common problem with conventional systems is that


they can't scale.
 As the amount of data increases, so does the time it
takes to process and store it.
 This can cause bottlenecks and system crashes, which
are not ideal for businesses looking to make quick
decisions based on their data.

[Type here]
 Conventional systems also lack flexibility in terms of

how they handle new types of information

Speed

 Speed is a critical component of any data processing


system. Speed is important because it allows you to:
o Process and analyse your data faster, which
means you can make better-informed decisions
about how to proceed with your business.

Storage

 The amount of data being created and stored is


growing exponentially, with estimates that it will
reach 44 zettabytes by 2020. That's a lot of storage
space!
 The problem with conventional systems is that they
don't scale well as you add more data.

Data Integration

 Data integration is one of the biggest challenges, as it


requires a lot of time and effort to combine different
sources into a single database.
 This is especially true when you're trying to integrate
data from multiple sources with different schemas and
formats.

Security

 Traditional databases are designed to be accessed by


trusted users within an organization, but this makes

[Type here]
it difficult to ensure that only authorized people have
access to sensitive information.
 Conventional systems are not equipped for big data.
They were designed for a different era, when the
volume of information was much smaller and more
manageable. Now that we're dealing with huge
amounts of data, conventional systems are struggling
to keep up. Conventional systems are also expensive
and time-consuming to maintain; they require
constant maintenance and upgrades in order to meet
new demands from users who want faster access
speeds and more features than ever before.

What is BigData?

 Data which are very large in size is called Big Data.


 Normally we work on data of size MB (WordDoc, Excel) or
maximum GB (Movies, Codes) but data in Peta bytes i.e.
10^15 byte size is called Big Data.
 It is a combination of structured, semi structured and
unstructured data collected by organizations that can be
mined for information and used in machine
learning projects, predictive modeling and other advanced
analytics applications.
 Big data is often used in business, science and
government .

[Type here]
The Importance of Big Data

 Companies depend on big data to improve customer


service, marketing, sales, team management, and many
other routine operations during their analysis.
 They rely on big data to innovate pioneering products and
solutions.
 Big data is the key to making informed and data-driven
decisions that can deliver tangible results.

Big Data Characteristics

There are five v's of Big Data that explains the characteristics.

5 V’s of Big Data


o Volume
o Veracity
o Variety
o Value
o Velocity

Volume

 The name Big Data itself is related to an enormous size.

[Type here]
 Big Data is a vast ‘volume’ of data generated from many
sources daily, such as business processes, machines,
social media platforms, networks, human
interactions, and many more.

Variety

 Big Data can be structured, unstructured, and semi-


structured that are being collected from different sources.

 Data will only be collected from databases and sheets in


the past, but these days the data will comes in array forms,
that are PDFs, Emails, audios, SM posts, photos,
videos, etc.

The data is categorized as below:

Structured data: In Structured schema, along with all the


required columns. It is in a tabular form. Structured Data is
stored in the relational database management system.

Semi-structured: In Semi-structured, the schema is not


appropriately defined, e.g., JSON, XML, CSV, TSV, and email.
OLTP (Online Transaction Processing) systems are built to
work with semi-structured data. It is stored in relations,
i.e., tables.

Unstructured Data: All the unstructured files, log files,


audio files, and image files are included in the unstructured
data. Some organizations have much data available, but they
did not know how to derive the value of data since the data is
raw.

[Type here]
Veracity

 Veracity means how much the data is reliable. It has many


ways to filter or translate the data.

 Veracity is the process of being able to handle and manage


data efficiently.

 Veracity refers to the degree of accuracy in data sets and


how trustworthy they are. Raw data collected from various
sources can cause data quality issues that may be difficult
to pinpoint.

Value

 Value is an essential characteristic of big data.

 It is not the data that we process or store.

 It is valuable and reliable data that we store, process, and


also analyze.

Velocity

 Velocity creates the speed by which the data is created


in real-time.
 It contains the linking of incoming data sets speeds, rate
of change, and activity bursts.

 Big data velocity deals with the speed at the data flows from
sources like application logs, business processes,
networks, and social media sites, sensors, mobile
devices, etc.

[Type here]
NOTE

Two more V’s have emerged over the past few


years: value and veracity. Data has intrinsic value. But it’s of
no use until that value is discovered. Equally important: How
truthful is your data—and how much can you rely on it?

The Primary Sources of Big Data:

A significant part of big data is generated from three


primary resources:

 Machine data.
 Social data.
 Transactional data.

1. Machine Data

 Machine data is automatically generated, either as a


response to a specific event or a fixed schedule.
 It means all the information is developed from multiple
sources such as smart sensors, SIEM logs, medical devices
and wearables, road cameras, IoT devices, satellites,
desktops, mobile phones, industrial machinery, etc.

2. Social Data

 It is derived from social media platforms through tweets,


retweets, likes, video uploads, and comments shared on
Facebook, Instagram, Twitter, YouTube, Linked In etc.
 The extensive data generated through social media
platforms and online channels offer qualitative and
quantitative insights on each crucial facet of brand-
customer interaction.
 Social media data spreads like wildfire and reaches an
extensive audience base. It gauges important insights

[Type here]
regarding customer behaviour, their sentiment regarding
products and services.
 This is why brands capitalising on social media
channels can build a strong connection with their online
demographic.

3. Transactional Data

 Transactional data is information gathered via online and


offline transactions during different points of sale.
 The data includes vital details like transaction time,
location, products purchased, product prices, payment
methods, discounts/coupons used, and other relevant
quantifiable information related to transactions.

The sources of transactional data include:

 Payment orders
 Invoices
 Storage records and
 E-receipts

Example- What Does Facebook Do with Its Big Data?

Facebook collects vast volumes of user data (in the range of


petabytes, or 1 million gigabytes) in the form of comments,
likes, interests, friends, and demographics. Facebook uses
this information in a variety of ways:

 To create personalized and relevant news feeds and


sponsored ads

 For photo tag suggestions

[Type here]
 Flashbacks of photos and posts with the most
engagement

 Safety check-ins during crises or disasters

Next up, let us look at a Big Data case study, understand


it’s nuances and then look at some of the challenges of Big
Data.

Big Data challenges are:

Main Concerns are: -

 How do you store and manage such huge data efficiently?


 How do we process and extract valuable information from
huge amount of data within a timeframe?

1. Data related challenge


 Not enough relevant data
 No signal data
 Unavailability
2. Missing the objectives
 Who are my best customers?
 How do my best customers shop?
 Understand hidden patterns- through habits
 Actionable ideas
 What are business benefits
3. Adoption Challenges
 Low Engagement.
 End users are not using.

[Type here]
1. Managing massive amounts of data

 It's in the name—big data is big. Most companies are


increasing the amount of data they collect daily.
Eventually, the storage capacity a traditional data center
can provide will be inadequate, which worries many
business leaders.

2. Integrating data from multiple sources

 The data itself presents another challenge to businesses.


There is a lot, but it is also diverse because it can come
from a variety of different sources.

3. Ensuring data quality

 Analytics and machine learning processes that depend


on big data to run also depend on clean, accurate data to
generate valid insights and predictions.
 If the data is corrupted or incomplete, the results may not
be what you expect. But as the sources, types, and
quantity of data increase, it can be hard to determine if
the data has the quality you need for accurate insights.

4. Keeping data secure

Many companies handle data that is sensitive, such as:

 Company data that competitors could use to take a


bigger market share of the industry
 Financial data that could give hackers access to
accounts

[Type here]
 Personal user information of customers that could be
used for identity theft
 If a business handles sensitive data, it will become a
target of hackers. To protect this data from attack,
businesses often hire cybersecurity professionals who
keep up to date on security best practices and techniques
to secure their systems.

5. Selecting the right big data tools

 Big data software comes in many varieties, and their


capabilities often overlap. How do you make sure you are
choosing the right big data tools?
 Often, the best option is to hire a consultant who can
determine which tools will fit best with what your
business wants to do with big data.

Intelligent data analysis

 Intelligent data analysis reveals implicit, previously


unknown and potentially valuable information or knowledge
from large amounts of data.
 Intelligent data analysis is also a kind of decision support
process.
 Based on artificial intelligence, machine learning, pattern
recognition, statistics, database and visualization
technology mainly, IDA automatically extracts useful
information, necessary knowledge and interesting models
from a lot of online data in order to help decision makers
make the right choices.

Analytics Reporting

[Type here]
Analytics is the method of Reporting is an action that
examining and analyzing includes all the needed
summarized data to make information and data and is put
business decisions. together in an organized way.

Identifying business events,


Questioning the data, gathering the required
understanding it, investigating it, information, organizing,
and presenting it to the end users summarizing, and presenting
are all part of analytics. existing data are all part of
reporting.

The purpose of reporting is to


The purpose of analytics is to
organize the data into
draw conclusions based on data.
meaningful information.

Reporting is provided to the


Analytics is used by data
appropriate business leaders to
analysts, scientists, and business
perform effectively and
people to make effective decisions.
efficiently within a firm.

Analysis vs Reporting

[Type here]
[Type here]
Statistical Concepts

Role of Statistics in BigData Analytics

 As such, statistics is a fundamental tool of data scientists,


who are expected to gather and analyze large amounts of
structured and unstructured data and report on their
findings.

Descriptive Statistics

It is used to describe the basic features of data that provide a


summary of the given data set which can either represent the entire
population or a sample of the population. It is derived from
calculations that include:

 Mean: It is the central value which is commonly known as


arithmetic average.
 Mode: It refers to the value that appears most often in a data
set.
 Median: It is the middle value of the ordered set that divides
it in exactly half.

2. Variability

Variability includes the following parameters:

 Standard Deviation: It is a statistic that calculates the


dispersion of a data set as compared to its mean.
 Variance: It refers to a statistical measure of the spread
between the numbers in a data set. In general terms, it

[Type here]
means the difference from the mean. A large variance
indicates that numbers are far apart from the mean or
average value. Small variance indicates that the
numbers are closer to the average values. Zero variance
indicates that the values are identical to the given set.
 Range: This is defined as the difference between the
largest and smallest value of a dataset.

Difference between population and sample

 A population is the entire group that you want to draw


conclusions about.
 A sample is the specific group that you will collect data
from.
 The size of the sample is always less than the total size of
the population

Sampling

[Type here]
 It is a process of selecting group of observations from the
population, to study the characteristics of the data to make
conclusion about the population.
 Example: Covaxin (a covid-19 vaccine) is tested over
thousands of males and females before giving to all the
people of country

Types of Sampling:
 Whethe the data set for sampling is randomized or not,
sampling is classified into two major groups:

 Probability Sampling

 Non-Probability Sampling

Probability Sampling (Random Sampling):


 In this type, data is randomly selected so that every
observations of population gets the equal chance to be
selected for sampling.
 Probability sampling is of 4 types:

 Simple Random Sampling

 Cluster Sampling

 Stratified Sampling

 Systematic Sampling

Non-Probability Sampling:
 In this type, data is not randomly selected. It mainly
depends upon how the statistician wants to select the data.

[Type here]
 The results may or maynot be biased with the population.
 Unlike probability sampling, each observation of population
doesn’t get the equal chance to be selected for sampling.
 Non-probability sampling is of 4 types:

 Convenience Sampling

 Judgmental/Purposive Sampling

 Snowball/Referral Sampling

 Quota Sampling

Sampling Error:
 Errors which occur during sampling process are known as
Sampling Errors.

OR

 Difference between observed value of a sample statistics and


the actual value of a population parameters.
 Advantage of Sampling:
 Reduce cost and Time

 Accuracy of Data

 Inferences can be applied to a larger population

 Less resource needed

Resampling:
 Resampling is the method that consist of drawing repeatedly
drawing samples from the population.
 It involves the selection of randomized cases with
replacement from sample.

[Type here]
 Note: In machine learning resampling is used to improve the
performance of the model.

Types of Resampling:
Two common methods of Resampling are:

 K-fold Cross-validation

 Bootstrapping

Statistical Inference

 The process of making inferences about the population


based on the samples taken from is called statistical
inference or inferential statistics.
 Any measure computed on the basis of a sample value is
called a statistic.
 Example- sample mean, sample standard deviation etc.

[Type here]
Prediction Error

 In statistics, prediction error refers to the difference


between the predicted values made by some model and the
actual values.

Analytic Processes and Tools

Top 10 Best Open-Source Big Data Tools in 2020

1. Hadoop
 Apache Hadoop is the most prominent and used tool
in big data industry with its enormous capability of
largescale processing data.
 This is 100% open-source framework and runs on
commodity hardware in an existing data center.
 Furthermore, it can run on a cloud infrastructure.

[Type here]
 Hadoop consists of four parts:
1. Hadoop Distributed File System: Commonly known
as HDFS, it is a distributed le system compatible with
very high scale bandwidth.
2. MapReduce: A programming model for processing
big data.
3. YARN: It is a platform used for managing and
scheduling Hadoop’s resources in Hadoop
infrastructure.
4. Libraries: To help other modules to work with
Hadoop.

2. Apache Spark

 Apache Spark is the next hype in the industry among


the big data tools.
 The key point of this open-source big data tool is it
fills the gaps of Apache Hadoop concerning data
processing.
 Interestingly, Spark can handle both batch data and
real-time data.
 It’s also quite easy to run Spark on a single local
system to make development and testing easier.
 Spark Core is the heart of the project, and it
facilitates many things like
1. distributed task transmission
2. scheduling
3. I/O functionality

[Type here]
 Spark is an alternative to Hadoop’s MapReduce.
Spark can run jobs 100 times faster than Hadoop’s
MapReduce.

3 Apache Storm
 It is a distributed real-time framework for reliably
processing the unbounded data stream. The
framework supports any programming language.
 The unique features of Apache Storm are:
 Massive scalability Fault-tolerance “fail fast, auto
restart” approach The guaranteed process of every
tuple Written in Clojure Runs on the JVM Supports
direct acrylic graph (DAG) topology Supports multiple
languages Supports protocols like JSON

[Type here]

You might also like