Big Data Analysis
Big Data Analysis
Big Data Analysis
Big Data is a collection of data that is huge in volume, yet growing exponentially
with time. It is a data with so large size and complexity that none of traditional data
management tools can store it or process it efficiently. Big data is also a data but with
huge size.
Big Data analytics is a process used to extract meaningful insights, such as hidden
patterns, unknown correlations, market trends, and customer preferences. Big Data
analytics provides various advantages—it can be used for better decision making,
preventing fraudulent activities, among other things.
Def: Big Data refers to extremely large and complex data sets that cannot be
effectively processed or analyzed using traditional data processing methods. It is
characterized by the volume, velocity, and variety of the data, and typically includes
both structured and unstructured data.
Big Data has certain characteristics and hence is defined using 4Vs namely:
The New York Stock Exchange is an example of Big Data that generates about one
terabyte of new trade data per day.
Social Media :The statistic shows that 500+terabytes of new data get ingested into
the databases of social media site Facebook, every day. This data is mainly generated
in terms of photo and video uploads, message exchanges, putting comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time.
With many thousand flights per day, generation of data reaches up to many Petabytes.
Types of Big Data:
1. Structured
2. Unstructured
3. Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is
termed as a ‘structured’ data.
Unstructured
Any data with unknown form or the structure is classified as unstructured data. In
addition to the size being huge, un-structured data poses multiple challenges in terms
of its processing for deriving value out of it. A typical example of unstructured data is
a heterogeneous data source containing a combination of simple text files, images,
videos etc.
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured
data as a structured in form but it is actually not defined with e.g. a table definition in
relational DBMS. Example of semi-structured data is a data represented in an XML
file.
Note: web application data, which is unstructured, consists of log files, transaction
history files etc. OLTP systems are built to work with structured data wherein data is
stored in relations (tables).
Characteristics Of Big Data
Volume
Variety
Velocity
Variability
Advantages:
The state of the practice of analytics is constantly evolving as new technologies and
techniques emerge, and organizations seek to leverage data to gain a competitive
advantage.
Applications of Analytics
Analytics has a wide range of applications across various industries and domains.
Some of the most common applications of analytics include:
Business analytics
Healthcare analytics
Fraud detection and prevention
Social media analytics
Predictive maintenance
Big data
Artificial intelligence and machine learning
Cloud computing
Data visualization
Challenges In Analytics:
Data Quality
Data Privacy and security
There are many different ways that Big Data analytics can be used in order to improve
businesses and organizations. Here are some examples:
Big Data Analytics describe the process of uncovering trends, patterns, and
correlations in large amounts of raw data to help make data-informed decisions.
- Descriptive Analytics
- Diagnostic Analytics
- Predictive Analytics
- Prescriptive Analytics
Is designed for big Data problems and data science projects. It represents the
step by step methodology which is needed to organize the activities and tasks
involved.
Phase 1 : Discovery
Phase 2: Data Preparation
Phase 3 : Model Planning
Phase 4: Model Building
Phase 5: Communication Results
Phase 6: Operationalize
Unit - 2
Linear Regression,
Logistic Regression
Reasons to Choose and Cautions
Additional Regression Models
Advanced Analytical Theory and Methods: Classification
Decision Tree
Naive Bayes
Diagnostics of Classifiers
Additional Classification Methods
Introduction:
Regression is a well-known statistical technique to model the predictive relationship
between several independent variables and one dependent variable.
The objective is to find the best-fitting curve for a dependent variable in a multi
dimensional space, with each independent variable being a dimension.
The curve could be a straight line or it could be a nonlinear curve.
The quality of fit of the curve to the data can be measured by a coefficient of
correlation( r ), which is the square root of the amount of variance explained by the
curve.
Steps:
1. List all the variables available for making the model
2. Establish a dependent variable of Interest
3. Examine visual relationships between variables of interest
4. Find a way to predict dependent variable using other variables.
Example: Suppose a company wishes to plan the manufacturing of jagur cars for
coming years.
The company looks at sales data regressively, I.e data of previous years sales.
Regressive analysis means estimating relationships between variables.
Statistical relationships are about which elements of data hang together and which one
are hang seperately. It is about categorizing variables that are distinct and unrelated to
other variables. It is about describing significant positive relationships and significant
negative differences.
[( x x )][( y y )]
r= ________________________
[( x x )2 ][( y y )2]
The name says it all: linear regression can be used only when there is a linear
relationship among the variables. It is a statistical model used to understand the
association between independent variables (X) and dependent variables (Y).
Linear regression is a simple and widely used algorithm. It is a supervised ML
algorithm for predictive analysis. It models a relationship between the independent
predictor and the dependent outcome or variable. Using the linearity equation.
If there is more than one independent variable, it is called multiple linear regression
and is expressed as follows:
where x denotes the explanatory variable. β1 β2…. Βn are the slope of the particular
regression line. β0 is the Y-intercept of the regression line.
Regression line of Y on X: Gives the most probable Y values from the given
values of X.
Regression line of X on Y: Gives the most probable X values from the given
values of Y.
Example: How can a university students GPA be predicted from his/her high school
percentage (HSP) of marks?
Consider a sample of ten students for whom their GPAs and high school
scores, HSP’s are known. Assume linear regression. Then
GPA = b1.HSP + A
The below figure shows a simple linear regression plot for the relationship
between the college GPA and the percentage of high school marks. HSP on x - axis
and GPA on the y axis.
Whenever a perfect linear relationship btween GPA and high school score
exists, all 10 points on the graph would fit on a straight line.
Whenever a imperfect linear relationship exists between these two variables, a
cluster of points on the graph, which slope upward may be obtained.
In the other words, students who got more marks in high school should get
more GPA in college as well.
A simple linear regression with two regression lines with different regression
equations. The scatter plot, two lines can fit best to summarize the relation between
GPA and high school percentage.
Following notations can be used for examinig which of the two lines is a better fit:
1. Yi denotes the observed response for experimental unit i.
2. Xi denotes the predictor value for experimental unit i.
3. y`i is the predicted response (or fitted value) for experimental unit i.
Then the equation for the best fitting line using a sum of the error estimating function
is:
ei = yi - y`i
Usually, regression lines are used in the financial sector and for business procedures.
Financial analysts use regression techniques to predict stock prices, commodities, etc.
whereas business analysts use them to forecast sales, inventories, and so on.
The best way to fit a line is by minimizing the sum of squared errors, i.e., the distance
between the predicted value and the actual value. The least square method is the
process of fitting the best curve for a set of data points. The formula to minimize the
sum of squared errors is as below:
Polynomial regression
You must have noticed in the above equations that the power of the independent
variable was one (Y = m*x+c). When the power of the independent variable is more
than one, it is referred to as polynomial regression (Y = m*x^2+c).
Since the degree is not 1, the best fit line won’t be a straight line anymore. Instead, it
will be a curve that fits into the data points.
Sometimes, this can result in overfitting or underfitting due to a higher degree of the
polynomial. Therefore, always plot the relationships to make sure the curve is just
right and not overfitted or underfitted.
https://www.geeksforgeeks.org/non-linear-regression-examples-ml/
Regression range from simple models to highly complex equations. Two primary uses
for regression are forecasting and optimization.
- Using linear analysis on sales data with monthly sales, a company could forecast
sales for future months.
- A financial company may be interested in minimizing its risk portfolio and hence
want to understand the top five factors or reasons for default by a customer.
- To predict the characteristics of child based on the characteristics of their parents.
- Predicting the prices of houses, considering the locality and builder characterstics in
a locality of a particular city.
Logistic Regression:
Regression models traditionally work with continuous numeric value data for
dependent and independent variables.
Let’s see how logistic regression squeezes the output to 0-1. We already know that the
equation of the best fit line is:
Logistic regression models can , work with dependent variables that have categorical
values such as whther a loan is approved or not. Logistic regression measures the
relationship between a categorical dependent variable and one or more independent
variables.
For example: logistic regression might be used to predict whether a patient has a
given disease (e.g: Diabetes) based on observed characteristics of the patient (age,
gender, body mass index, results of blood tests, etc.).
In real-world scenarios, we will never see a case where the variables are perfectly
independent. Multicollinearity will always occur in real data. Here, the least square
method fails to produce good results because it gives unbiased values. Their variances
are large which deviates the observed value far from the true value. Ridge regression
adds a penalty to the model with high variance, thereby shrinking the beta coefficients
to zero which helps avoid overfitting.
In linear regression, we minimize the cost function. Remember that the goal of a
model is to have low variance and low bias. To achieve this, we add another term in
the cost function of linear regression: “lambda” and “slope”.
Lasso regression
Lasso or least absolute shrinkage and selection operator regression is very similar to
ridge regression. It is capable of reducing the variability and improving the accuracy
of linear regression models. In addition, it helps us perform feature selection. Instead
of squares, it uses absolute values in the penalty function.
In the ridge regression explained above, the best fit line was finally getting
somewhere near zero (0). The whole slope was not a straight line but was moving
towards zero. However, in lasso regression, it will move towards zero. Wherever the
slope value is less, those features will be removed. This means that the features are
not important for predicting the best fit line. This, in turn, helps us perform feature
selection.
How to select the right regression analysis model
To select the best, it’s important to focus on the dimensionality of the data and other
essential characteristics.
1. Linearity assumption
2. Association And/Or correlation do not mean Causation
3. Extrapolation
4. Outliers and influential points
Linearity
Remember, it is always important to plot a scatter diagram first. If the scatter plot
indicates that there is a linear relationship between the variables, then it is reasonable
to use the methods we are discussing.
Even when we do have an apparent linear relationship and find a reasonable value of r,
there can always be confounding or lurking variables at work. Be wary of spurious
correlations and make sure the connection you are making makes sense!
There are also often situations where it may not be clear which variable is causing
which. Does lack of sleep lead to higher stress levels or does high stress levels lead to
lack of sleep? Which came first, the chicken or the egg? Sometimes these may not be
answerable, but at least we are able to show an association there.
Extrapolation
Remember, it is always important to plot a scatter diagram first. If the scatter plot
indicates that there is a linear relationship between the variables, then it is reasonable
to use a best fit line to make predictions for y given x within the domain of x-values in
the sample data, but not necessarily for x-values outside that domain. The process of
predicting inside of the observed x values observed in the data is called interpolation.
The process of predicting outside of the observed x values observed in the data is
called extrapolation.
In some data sets, there are values (observed data points) that may appear to be
outliers x or y. Outliers are points that seem to stick out from the rest of the group in a
single variable. Besides outliers, a sample may contain one or a few points that are
called influential points. Influential points are observed data points that do not follow
the trend of the rest of the data. These points may have a big effect on the calculation
of the slope of the regression line. To begin to identify an influential point, you can
remove it from the data set and see if the slope of the regression line is changed
significantly.
How do we handle these unusual points? Sometimes they should not be included in
the analysis of the data. It is possible that an outlier or influential point is a result of
erroneous data. Other times it may hold valuable information about the population
under study and should remain included in the data. The key is to examine carefully
what causes a data point to be an outlier and/or influential point.
Computers and many calculators can be used to identify outliers from the data.
Computer output for regression analysis will often identify both outliers and
influential points so that you can examine them.
We know how to find outliers in a single variable using fence rules and boxplots.
However, we would like some guideline as to how far away a point needs to be in
order to be considered an influential point. They also have large “errors”, where the
“error” or residual is the vertical distance from the line to the point. As a rough rule of
thumb, we can flag any point that is located further than two standard deviations
above or below the best-fit line as an outlier. The standard deviation used is the
standard deviation of the residuals or errors.
We can do this visually in the scatter plot by drawing an extra pair of lines that are
two standard deviations above and below the best-fit line. Any data points that are
outside this extra pair of lines are flagged as potential outliers. Or we can do this
numerically by calculating each residual and comparing it to twice the standard
deviation. The graphical procedure is shown in the example below, followed by the
numerical calculations in the next example. You would generally need to use only one
of these methods.
Introduction
Decision Tree
Naive Bayes
k Nearst Nighbours
Support Vector machine
Neural Networks
Random Forest
Neural Network
First, there is neural network. It is a set of algorithms that attempt to identify the underlying
relationships in a data set through a process that mimics how human brain operates. In data
science, neural networks help to cluster and classify complex relationship. Neural networks could
be used to group unlabelled data according to similarities among the example inputs and classify
data when they have a labelled dataset to train on.
K-Nearest Neighbors
KNN (K-Nearest Neighbors) becomes one of many algorithms used in data mining and machine
learning, KNN is a classifier algorithm in which the learning is based on the similarity of data (a
vector) from others. It also could be used to store all available cases and classifies new cases based
on a similarity measure (e.g., distance functions).
Decision Tree
Decision tree algorithm is included in supervised learning algorithms. This algorithm could be
used to solve regression and other classification problems. Decision tree builds classification or
regression models in the form of a tree structure. It breaks down a dataset into smaller and smaller
subsets while at the same time an associated decision tree is incrementally developed. The purpose
of using decision tree algorithm is to predict class or value of target variable by learning simple
decision rules concluded from prior data.
Random Forest
Random forests are an ensemble learning method for classification, regression and other tasks that
operates by constructing multiple decision trees at training time. For classification task, the output
from the random forest is the class selected by most trees. For the regression task, the mean or
mean prediction of each tree is returned. Random forests generally outperform decision trees but
have lower accuracy than gradient boosted trees. However, the characteristics of the data can
affect its performance.
Naïve Bayes
Naive Bayes is a classification technique based on Bayes' theorem with the assumption of
independence between predictors. In simple terms, the Naive Bayes classifier assumes that the
presence of certain features in a class is not related to the presence of other features. Classifier
assumes that the presence of a particular feature in a class is unrelated to the presence of any other
feature. It's updating knowledge step by step with new info.
Decision Tree
https://www.mastersindatascience.org/learning/machine-learning-algorithms/decision-
tree/#:~:text=A%20decision%20tree%20is%20a,that%20contains%20the%20desired%20categori
zation.
Naive Bayes
https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/
SVM
https://www.spiceworks.com/tech/big-data/articles/what-is-support-vector-
machine/#:~:text=A%20support%20vector%20machine%20(SVM)%20is%20a%20machine%20l
earning%20algorithm,classes%2C%20labels%2C%20or%20outputs.
KNN:
https://www.analyticsvidhya.com/blog/2018/03/introduction-k-neighbours-algorithm-
clustering/#:~:text=The%20K%2DNearest%20Neighbors%20(KNN)%20algorithm%20is%20a%
20popular,have%20similar%20labels%20or%20values.
Unit - 4
We can predict ,
Daily Stock Price
Weekly interest rates
Sales figures
Where the outcome (independent variable) is dependent on time. In such
scenarios, we use Time Series forecasting
A B
1 Date Close
2 1/4/2017 139.92
3 2/4/2017 139.58
4 3/4/2017 139.59
5 4/4/2017 141.42
6 5/4/2017 140.96
7 6/4/2027 142.27
8 7/4/2017 143.81
9 8/4/2017 142.62
10 9/4/2017 143.47
The stock prices change everyday
Trend
Seasonality
Cyclicity
Irregularity
Trend :
Trend is the increase or decrease in the series over a period of time. It
persists over a long period of time
Example: Population growth over the years can be seen as an upward trend.
Seasonality:
Regular pattern of up and down fluctuations , it is a short term variation
occuring due to seasonal factors
Cyclicity:
It is a medium term variation caused by circumstances, which repeat in
irregular intervals.
Irregularity:
It refers to variations which occur due to unpredictable factors and also do not
repeat in particular patterns.
There are various conditions where you should not use time series.
How do you differentiate between a stationary and Non- stationary Time series:
The mean of the series should not be a function of time rather should be a constant.
The Covariance of the ith term and the (i+m)th term should not be a function of time.
Rainfall measurements
Stock prices
Number of sunspots
Annual Retail sales
Monthly Subscribers
Heartbeat per minute
Data Import
Data Cleaning
Stationary Check
Model Training
Prediction
Tuning
Text Analysis:
Text analysis is the process of using computer systems to read and understand human
- written text for business insights.
Text analysis software can independently classify, sort, and extract
information from text to identify patterns, relationships, sentiments, and other
actionable knowledge.
You can use text analysis to efficiently and accurately process multiple text-
based sources such as emails, documents, social media content, and product reviews,
like a human would.
Businesses use text analysis to extract actionable insights from various unstructured
data sources. They depend on feedback from sources like emails, social media, and
customer survey responses to aid decision making. However, the immense volume of
text from such sources proves to be overwhelming without text analytics software.
With text analysis, you can get accurate information from the sources more
quickly. The process is fully automated and consistent, and it displays data you can
act on. For example, using text analysis software allows you to immediately detect
negative sentiment on social media posts so you can work to solve the problem
Sentiment analysis
Sentiment analysis or opinion mining uses text analysis methods to understand the
opinion conveyed in a piece of text.
You can use sentiment analysis of reviews, blogs, forums, and other online
media to determine if your customers are happy with their purchases.
Sentiment analysis helps you spot new trends, track sentiment changes, and
tackle PR issues.
By using sentiment analysis and identifying specific keywords, you can track
changes in customer opinion and identify the root cause of the problem.
Record management
For example, LexisNexis Legal & Professional uses text extraction to identify
specific records among 200 million documents.
You can use text analysis software to process emails, reviews, chats, and other text-
based correspondence.
With insights about customers’ preferences, buying habits, and overall brand
perception, you can tailor personalized experiences for different customer segments.
Text analysis software works on the principles of deep learning and natural
language processing.
Deep learning
It uses linguistic models and statistics to train the deep learning technology to
process and analyze text data, including handwritten text images.
Text classification
Text extraction
Topic modeling
PII redaction
Text classification
In text classification, the text analysis software learns how to associate certain
keywords with specific topics, users’ intentions, or sentiments. It does so by using the
following methods:
They use linguistic models like Naive Bayes, Support Vector Machines, and
Deep Learning to process structured data, categorize words, and develop a semantic
understanding between them.
Text extraction
Text extraction scans the text and pulls out key information. It can identify keywords,
product attributes, brand names, names of places, and more in a piece of text. The
extraction software applies the following methods:
For example, you can use text extraction to monitor brand mentions on social media.
Manually tracking every occurrence of your brand on social media is impossible.
Text extraction will alert you to mentions of your brand in real time.
Topic modeling
Topic modeling methods identify and group related keywords that occur in an
unstructured text into a topic or theme. These methods can read multiple text
documents and sort them into themes based on the frequency of various words in the
document. Topic modeling methods give context for further analysis of the documents.
For example, you can use topic modeling methods to read through your scanned
document archive and classify documents into invoices, legal documents, and
customer agreements. Then you can run different analysis methods on invoices to
gain financial insights or on customer agreements to gain customer insights.
PII redaction
PII redaction automatically detects and removes personally identifiable information
(PII) such as names, addresses, or account numbers from a document. PII redaction
helps protect privacy and comply with local laws and regulations.
For example, you can analyze support tickets and knowledge articles to detect and
redact PII before you index the documents in the search solution. After that, search
solutions are free of PII in documents.
In this stage, you gather text data from internal or external sources.
Internal data
Internal data is text content that is internal to your business and is readily available—
for example, emails, chats, invoices, and employee surveys.
External data
You can find external data in sources such as social media posts, online reviews, news
articles, and online forums. It is harder to acquire external data because it is beyond
your control. You might need to use web scraping tools or integrate with third-party
solutions to extract external data.
Data preparation is an essential part of text analysis. It involves structuring raw text
data in an acceptable format for analysis. The text analysis software automates the
process and involves the following common natural language processing (NLP)
methods.
Tokenization
Tokenization is segregating the raw text into multiple parts that make semantic sense.
For example, the phrase text analytics benefits businesses tokenizes to the
words text, analytics, benefits, and businesses.
Part-of-speech tagging
Part-of-speech tagging assigns grammatical tags to the tokenized text. For example,
applying this step to the previously mentioned tokens results in text: Noun; analytics:
Noun; benefits: Verb; businesses: Noun.
Parsing
Parsing establishes meaningful connections between the tokenized words with
English grammar. It helps the text analysis software visualize the relationship between
words.
Lemmatization
Lemmatization is a linguistic process that simplifies words into their dictionary form,
or lemma. For example, the dictionary form of visualizing is visualize.
Stop words are words that offer little or no semantic context to a sentence, such
as and, or, and for. Depending on the use case, the software might remove them from
the structured text.
Text analysis is the core part of the process, in which text analysis software processes
the text by using different methods.
Text classification
Classification is the process of assigning tags to the text data that are based on rules or
machine learning-based systems.
Text extraction
Extraction involves identifying the presence of specific keywords in the text and
associating them with tags. The software uses methods such as regular expressions
and conditional random fields (CRFs) to do this.
Stage 4—Visualization
Visualization is about turning the text analysis results into an easily understandable
format. You will find text analytics results in graphs, charts, and tables. The
visualized results help you identify patterns and trends and build action plans. For
example, suppose you’re getting a spike in product returns, but you have trouble
finding the causes. With visualization, you look for words such as defects, wrong size,
or not a good fit in the feedback and tabulate them into a chart. Then you’ll know
which is the major issue that takes top priority.
Text analytics helps you determine if there’s a particular trend or pattern from the
results of analyzing thousands of pieces of feedback. Meanwhile, you can use text
analysis to determine whether a customer’s feedback is positive or negative.
Text analysis vs. text mining
There is no difference between text analysis and text mining. Both terms refer to the
same process of gaining valuable insights from sources such as email, survey
responses, and social media feeds.