Module-1-Introduction To BigData Platform
Module-1-Introduction To BigData Platform
Module-1-Introduction To BigData Platform
Scalability
Speed
Storage
Data Integration
Security
Scalability
[Type here]
Conventional systems also lack flexibility in terms of
Speed
Storage
Data Integration
Security
[Type here]
it difficult to ensure that only authorized people have
access to sensitive information.
Conventional systems are not equipped for big data.
They were designed for a different era, when the
volume of information was much smaller and more
manageable. Now that we're dealing with huge
amounts of data, conventional systems are struggling
to keep up. Conventional systems are also expensive
and time-consuming to maintain; they require
constant maintenance and upgrades in order to meet
new demands from users who want faster access
speeds and more features than ever before.
What is BigData?
[Type here]
The Importance of Big Data
There are five v's of Big Data that explains the characteristics.
Volume
[Type here]
Big Data is a vast ‘volume’ of data generated from many
sources daily, such as business processes, machines,
social media platforms, networks, human
interactions, and many more.
Variety
[Type here]
Veracity
Value
Velocity
Big data velocity deals with the speed at the data flows from
sources like application logs, business processes,
networks, and social media sites, sensors, mobile
devices, etc.
[Type here]
NOTE
Machine data.
Social data.
Transactional data.
1. Machine Data
2. Social Data
[Type here]
regarding customer behaviour, their sentiment regarding
products and services.
This is why brands capitalising on social media
channels can build a strong connection with their online
demographic.
3. Transactional Data
Payment orders
Invoices
Storage records and
E-receipts
[Type here]
Flashbacks of photos and posts with the most
engagement
[Type here]
1. Managing massive amounts of data
[Type here]
Personal user information of customers that could be
used for identity theft
If a business handles sensitive data, it will become a
target of hackers. To protect this data from attack,
businesses often hire cybersecurity professionals who
keep up to date on security best practices and techniques
to secure their systems.
Analytics Reporting
[Type here]
Analytics is the method of Reporting is an action that
examining and analyzing includes all the needed
summarized data to make information and data and is put
business decisions. together in an organized way.
Analysis vs Reporting
[Type here]
[Type here]
Statistical Concepts
Descriptive Statistics
2. Variability
[Type here]
means the difference from the mean. A large variance
indicates that numbers are far apart from the mean or
average value. Small variance indicates that the
numbers are closer to the average values. Zero variance
indicates that the values are identical to the given set.
Range: This is defined as the difference between the
largest and smallest value of a dataset.
Sampling
[Type here]
It is a process of selecting group of observations from the
population, to study the characteristics of the data to make
conclusion about the population.
Example: Covaxin (a covid-19 vaccine) is tested over
thousands of males and females before giving to all the
people of country
Types of Sampling:
Whethe the data set for sampling is randomized or not,
sampling is classified into two major groups:
Probability Sampling
Non-Probability Sampling
Cluster Sampling
Stratified Sampling
Systematic Sampling
Non-Probability Sampling:
In this type, data is not randomly selected. It mainly
depends upon how the statistician wants to select the data.
[Type here]
The results may or maynot be biased with the population.
Unlike probability sampling, each observation of population
doesn’t get the equal chance to be selected for sampling.
Non-probability sampling is of 4 types:
Convenience Sampling
Judgmental/Purposive Sampling
Snowball/Referral Sampling
Quota Sampling
Sampling Error:
Errors which occur during sampling process are known as
Sampling Errors.
OR
Accuracy of Data
Resampling:
Resampling is the method that consist of drawing repeatedly
drawing samples from the population.
It involves the selection of randomized cases with
replacement from sample.
[Type here]
Note: In machine learning resampling is used to improve the
performance of the model.
Types of Resampling:
Two common methods of Resampling are:
K-fold Cross-validation
Bootstrapping
Statistical Inference
[Type here]
Prediction Error
1. Hadoop
Apache Hadoop is the most prominent and used tool
in big data industry with its enormous capability of
largescale processing data.
This is 100% open-source framework and runs on
commodity hardware in an existing data center.
Furthermore, it can run on a cloud infrastructure.
[Type here]
Hadoop consists of four parts:
1. Hadoop Distributed File System: Commonly known
as HDFS, it is a distributed le system compatible with
very high scale bandwidth.
2. MapReduce: A programming model for processing
big data.
3. YARN: It is a platform used for managing and
scheduling Hadoop’s resources in Hadoop
infrastructure.
4. Libraries: To help other modules to work with
Hadoop.
2. Apache Spark
[Type here]
Spark is an alternative to Hadoop’s MapReduce.
Spark can run jobs 100 times faster than Hadoop’s
MapReduce.
3 Apache Storm
It is a distributed real-time framework for reliably
processing the unbounded data stream. The
framework supports any programming language.
The unique features of Apache Storm are:
Massive scalability Fault-tolerance “fail fast, auto
restart” approach The guaranteed process of every
tuple Written in Clojure Runs on the JVM Supports
direct acrylic graph (DAG) topology Supports multiple
languages Supports protocols like JSON
[Type here]