Big Data Analysis

Unit - 1
Introduction to Big Data Analytics:
Big Data is a collection of data that is huge in volume, yet growing exponentially
with time. It is a data with so large size and complexity that none of traditional data
management tools can store it or process it efficiently. Big data is also a data but with
huge size.
Big Data analytics is a process used to extract meaningful insights, such as hidden
patterns, unknown correlations, market trends, and customer preferences. Big Data
analytics provides various advantages—it can be used for better decision making,
preventing fraudulent activities, among other things.
Big Data Over view:
Def: Big Data refers to extremely large and complex data sets that cannot be
effectively processed or analyzed using traditional data processing methods. It is
characterized by the volume, velocity, and variety of the data, and typically includes
both structured and unstructured data.
Big Data has certain characteristics and hence is defined using 4Vs namely:
Some of the Big Data examples :
The New York Stock Exchange is an example of Big Data that generates about one
terabyte of new trade data per day.
Social Media :The statistic shows that 500+terabytes of new data get ingested into
the databases of social media site Facebook, every day. This data is mainly generated
in terms of photo and video uploads, message exchanges, putting comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time.
With many thousand flights per day, generation of data reaches up to many Petabytes.
Types of Big Data:
1. Structured
2. Unstructured
3. Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is
termed as a ‘structured’ data.
Data stored in a relational database management system is one example of

a ‘structured’ data.
An ‘Employee’ table in a database is an example of Structured Data
Employee_ID Employee_Name Gender Department Salary_In_lacs

2365 Rajesh Katta Male Finance 650000
3398 Shirisha Female Admin 650000
7465 Shiva Kumar Male Admin 500000
7500 Pavan Kumar Male Finance 500000
7699 Krishnaveni A Female Finance 550000
Unstructured
Any data with unknown form or the structure is classified as unstructured data. In
addition to the size being huge, un-structured data poses multiple challenges in terms
of its processing for deriving value out of it. A typical example of unstructured data is
a heterogeneous data source containing a combination of simple text files, images,
videos etc.
The output returned by ‘Google Search’ is an Examples Of Un-structured Data
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured
data as a structured in form but it is actually not defined with e.g. a table definition in
relational DBMS. Example of semi-structured data is a data represented in an XML
file.
Personal data stored in an XML file is an Examples Of Semi-structured Data
Note: web application data, which is unstructured, consists of log files, transaction
history files etc. OLTP systems are built to work with structured data wherein data is
stored in relations (tables).
Characteristics Of Big Data
Big data can be described by the following characteristics:
 Volume
 Variety
 Velocity
 Variability
Advantages:
 Improved customer service,

 better operational efficiency,
 Better Decision Making are few advantages of Bigdata.
State of the Practice in Analytics:

The State of the Practice of Analytics refers to the current state and ongoing evolution
of the use of data, statistical analysis, and other quantitative methods to derive
insights and inform decision-making. It encompasses the various applications of
analytics across industries and domains, as well as the key trends and future directions
in the field.
The state of the practice of analytics is constantly evolving as new technologies and
techniques emerge, and organizations seek to leverage data to gain a competitive
advantage.
The State of the Practice of Analytics is constantly evolving, driven by advances in

technology, changing business needs, and evolving customer expectations. It includes
the use of analytics in areas such as business operations, healthcare delivery, fraud
detection, social media, and predictive maintenance.
Applications of Analytics
Analytics has a wide range of applications across various industries and domains.
Some of the most common applications of analytics include:
Business analytics
Healthcare analytics
Fraud detection and prevention
Social media analytics
Predictive maintenance
Key Trends in Analytics:
Big data
Artificial intelligence and machine learning
Cloud computing
Data visualization
Challenges In Analytics:
Data Quality
Data Privacy and security
Key roles for the New Big Data Ecosystems:

Understanding the different roles in Bid Data Organizations.
Big Data Analyst
Big Data Scientist
Big Data Engineer
Other Big Data roles

Examples of Big Data Analytics:
There are many different ways that Big Data analytics can be used in order to improve
businesses and organizations. Here are some examples:
 Using analytics to understand customer behavior in order to optimize the

customer experience
 Predicting future trends in order to make better business decisions
 Improving marketing campaigns by understanding what works and what

doesn't
 Increasing operational efficiency by understanding where bottlenecks are and

how to fix them
 Detecting fraud and other forms of misuse sooner
Data Analytics Life Cycle:
Big Data Analytics describe the process of uncovering trends, patterns, and
correlations in large amounts of raw data to help make data-informed decisions.
How Big Data Analytics works:

- Data Professionals collect data from a variety of different sources.
- Data is prepared & processed
- Data Cleansed to improve its quality.
- Data Analyse using Analytical Software.
Analytical Software that includes,

- Data Mining
- Predictive Analytics
- Machine Learning
- Deep Learning
- Text Mining and Statistical Analysis software
- Artificial Intelligence
- Mainstream BI software
- Data Visualization Tools
Types of Big Data Analytics:
- Descriptive Analytics
- Diagnostic Analytics
- Prescriptive Analytics
Key Big Data Analytics Technologies and Tools:

- Hadoop
- Stream Analytics
- Distributed Storage
- No Sql DB
- A data Lake
- A data warehouse
- Knowledge discovery & Big Data Mining
- In Memory Data Fabric
- Data Virtualization
- Data Quality
- Data Preprocessing
- Apache Spark
- Microsoft Power BI & Tableau
Data Analytics Life Cycle overview:
Data Analytics Life Cycle:
Is designed for big Data problems and data science projects. It represents the
step by step methodology which is needed to organize the activities and tasks
involved.
Phase 1 : Discovery
Phase 2: Data Preparation
Phase 3 : Model Planning
Phase 4: Model Building
Phase 5: Communication Results
Phase 6: Operationalize
Unit - 2
Advanced Analytical Theory and Methods- Clustering:

Overview of Clustering
K- Means
Additional Algorithms Advanced Analytical Theory and
Methods
Association Rules :
Overview
Apriori Algorithm
Evaluation of Candidate Rules
Applications of Association Rules
An Example : Transaction in a Grocery Store
Validation and Testing, Diagnostics.
Unit - 3
Advanced Analytical Theory and Methods - Regression:
Linear Regression,
Logistic Regression
Reasons to Choose and Cautions
Additional Regression Models
Advanced Analytical Theory and Methods: Classification
Decision Tree
Naive Bayes
Diagnostics of Classifiers
Additional Classification Methods
Introduction:
Regression is a well-known statistical technique to model the predictive relationship
between several independent variables and one dependent variable.
The objective is to find the best-fitting curve for a dependent variable in a multi
dimensional space, with each independent variable being a dimension.
The curve could be a straight line or it could be a nonlinear curve.
The quality of fit of the curve to the data can be measured by a coefficient of
correlation( r ), which is the square root of the amount of variance explained by the
curve.
Steps:
1. List all the variables available for making the model
2. Establish a dependent variable of Interest
3. Examine visual relationships between variables of interest
4. Find a way to predict dependent variable using other variables.
Example: Suppose a company wishes to plan the manufacturing of jagur cars for
coming years.
The company looks at sales data regressively, I.e data of previous years sales.
Regressive analysis means estimating relationships between variables.
Regression analysis may require many techniques for modeling and

performing the analysis using multiple variables. Regression analysis facilitates
prediction of future values of dependent variables.
Need of Regression Analysis Techniques:
Regression analysis helps organizations to understand what their data points

mean and to use them carefully with business analysis techniques to arrive at better
decisions. It showcases how dependent variables vary when one of the independent
variables is varied and the other independent variables remain unchanged. It acts as a
tool to help business analysts and data experts pick significant variables and delete
unwanted ones.
Note: It’s very important to understand a variable before feeding it into a
model. A good set of input variables can impact the success of a business.
Correlations and Relationships:
Statistical relationships are about which elements of data hang together and which one
are hang seperately. It is about categorizing variables that are distinct and unrelated to
other variables. It is about describing significant positive relationships and significant
negative differences.
The first and foremost strength of a relationship is a correlation. The strength

of correlation measured in a normalized range between 0 and 1.
Where
1 indicates - perfect relationship where two variables are in perfect sync
0 indicates - no relationship between the variables.
The relationship can be posistive or it can be an inverse relationship, that is the

variables may move together in the same direction ot in the opposite direction.
Therefore , a good measure of correlation is the coefficient, which is the square root
of correlation. This coefficient called r, can range from -1 to +1.
Where, r is equal to
0 - signifies No relationship.
1 - shows perfect relationship in the same direction.
-1 - shows a perfect relationship but moving in opposite directions.
The coefficient of correlation r is mathematically computed as follows,
[( x  x )][( y  y )]
r= ________________________
[( x  x )2 ][( y  y )2]
Types of regression techniques

There are several types regression analysis, each have their own advantages and
disadvantages.
Simple Linear Regression:
The name says it all: linear regression can be used only when there is a linear
relationship among the variables. It is a statistical model used to understand the
association between independent variables (X) and dependent variables (Y).
Linear regression is a simple and widely used algorithm. It is a supervised ML
algorithm for predictive analysis. It models a relationship between the independent
predictor and the dependent outcome or variable. Using the linearity equation.
Y = f(a0, a1) - a0+a1.x
Where a0 is a constant and a1 linearity coefficient.
Simple linear regression is performed when the requirement is prediction of values of

one variable, with given values of another variable.
If there is more than one independent variable, it is called multiple linear regression
and is expressed as follows:
where x denotes the explanatory variable. β1 β2…. Βn are the slope of the particular
regression line. β0 is the Y-intercept of the regression line.
If we take two variables, X and Y, there will be two regression lines:
 Regression line of Y on X: Gives the most probable Y values from the given
values of X.
 Regression line of X on Y: Gives the most probable X values from the given
values of Y.
Example: How can a university students GPA be predicted from his/her high school
percentage (HSP) of marks?
Consider a sample of ten students for whom their GPAs and high school
scores, HSP’s are known. Assume linear regression. Then
GPA = b1.HSP + A
The below figure shows a simple linear regression plot for the relationship
between the college GPA and the percentage of high school marks. HSP on x - axis
and GPA on the y axis.
Whenever a perfect linear relationship btween GPA and high school score
exists, all 10 points on the graph would fit on a straight line.
Whenever a imperfect linear relationship exists between these two variables, a
cluster of points on the graph, which slope upward may be obtained.
In the other words, students who got more marks in high school should get
more GPA in college as well.
The purpose of regression analysis is to come up with an equation of a line

that fits through a cluster of points with minimal amount of deviation from the line.
The best-fitting line is called the regression line. The deviation of the points from the
line is called an ‘error’.
Once this regression equation is obtained, the GPA of a student in college
examinations can be predicted provided his/her high school percentage is given.
Simple linear regression is actually the same as a correlation between independent and
dependent variables.
A simple linear regression with two regression lines with different regression
equations. The scatter plot, two lines can fit best to summarize the relation between
GPA and high school percentage.
Following notations can be used for examinig which of the two lines is a better fit:
1. Yi denotes the observed response for experimental unit i.
2. Xi denotes the predictor value for experimental unit i.
3. yì is the predicted response (or fitted value) for experimental unit i.
Then the equation for the best fitting line using a sum of the error estimating function
is:
Yì = a`0 + a`1 xi,

Where a`0 and a`1 are the coefficients in Equation. Use of the above equation to
predict the actual response yi, leads to a prediction error of size:
ei = yi - yì
Usually, regression lines are used in the financial sector and for business procedures.
Financial analysts use regression techniques to predict stock prices, commodities, etc.
whereas business analysts use them to forecast sales, inventories, and so on.
How is the best fit line achieved?
The best way to fit a line is by minimizing the sum of squared errors, i.e., the distance
between the predicted value and the actual value. The least square method is the
process of fitting the best curve for a set of data points. The formula to minimize the
sum of squared errors is as below:
where yi is the actual value and yi_cap is the predicted value.
Assumptions of linear regression
 Independent and dependent variables should be linearly related.

 All the variables should be independent of each other, i.e., a change in one
variable should not affect another variable.
 Outliers must be removed before fitting a regression line.
 There must be no multicollinearity.
Polynomial regression
You must have noticed in the above equations that the power of the independent
variable was one (Y = m*x+c). When the power of the independent variable is more
than one, it is referred to as polynomial regression (Y = m*x^2+c).
Since the degree is not 1, the best fit line won’t be a straight line anymore. Instead, it
will be a curve that fits into the data points.
Sometimes, this can result in overfitting or underfitting due to a higher degree of the
polynomial. Therefore, always plot the relationships to make sure the curve is just
right and not overfitted or underfitted.
Non- Linear Regression:
The relationship between the variables may also be curvilinear.
https://www.geeksforgeeks.org/non-linear-regression-examples-ml/
Follow this link for more on non linear regression.
Modelling Possibilities using Regression:
Regression range from simple models to highly complex equations. Two primary uses
for regression are forecasting and optimization.
- Using linear analysis on sales data with monthly sales, a company could forecast
sales for future months.
- A financial company may be interested in minimizing its risk portfolio and hence
want to understand the top five factors or reasons for default by a customer.
- To predict the characteristics of child based on the characteristics of their parents.
- Predicting the prices of houses, considering the locality and builder characterstics in
a locality of a particular city.
Logistic Regression:
Regression models traditionally work with continuous numeric value data for
dependent and independent variables.
Logistic regression analysis is generally used to find the probability of an event. It is

used when the dependent variable is dichotomous or binary. For example, if the
output is 0 or 1, True or False, Yes or No, Cat or Dog, etc., it is said to be a binary
variable. Since it gives us the probability, the output will be in the range of 0-1.
Let’s see how logistic regression squeezes the output to 0-1. We already know that the
equation of the best fit line is:
Logistic regression models can , work with dependent variables that have categorical
values such as whther a loan is approved or not. Logistic regression measures the
relationship between a categorical dependent variable and one or more independent
variables.
For example: logistic regression might be used to predict whether a patient has a
given disease (e.g: Diabetes) based on observed characteristics of the patient (age,
gender, body mass index, results of blood tests, etc.).
Important points to note
 Logistic regression is mostly used in classification problems.

 Unlike linear regression, it doesn’t require a linear relationship among
dependent and independent variables because it applies non-linear log
transformation to the predicted odds ratio.
 If there are various classes in the output, it is called multinomial logistic
regression.
 Like linear regression, it doesn’t allow multicollinearity.
Ridge regression
Before we explore ridge regression, let’s examine regularization, a method to enable a

model to work on unseen data by ignoring less important features.
There are two types of regularization techniques, ridge and lasso

regression/regularization.
In real-world scenarios, we will never see a case where the variables are perfectly
independent. Multicollinearity will always occur in real data. Here, the least square
method fails to produce good results because it gives unbiased values. Their variances
are large which deviates the observed value far from the true value. Ridge regression
adds a penalty to the model with high variance, thereby shrinking the beta coefficients
to zero which helps avoid overfitting.
In linear regression, we minimize the cost function. Remember that the goal of a
model is to have low variance and low bias. To achieve this, we add another term in
the cost function of linear regression: “lambda” and “slope”.
Lasso regression
Lasso or least absolute shrinkage and selection operator regression is very similar to
ridge regression. It is capable of reducing the variability and improving the accuracy
of linear regression models. In addition, it helps us perform feature selection. Instead
of squares, it uses absolute values in the penalty function.
The equation of lasso regression is:
In the ridge regression explained above, the best fit line was finally getting
somewhere near zero (0). The whole slope was not a straight line but was moving
towards zero. However, in lasso regression, it will move towards zero. Wherever the
slope value is less, those features will be removed. This means that the features are
not important for predicting the best fit line. This, in turn, helps us perform feature
selection.
How to select the right regression analysis model
To select the best, it’s important to focus on the dimensionality of the data and other
essential characteristics.
1. Exploratory data analysis is a crucial part of building a predictive model. It is

and should be the first step before selecting the right model. It helps identify
the relationship between the variables.
2. We can use different statistical parameters like R-square, adjusted square, area
under the curve (AUC), and receiver operating characteristic (ROC) curve to
compare the soundness of fit for different models.
3. Cross-validation is a good way to evaluate a model. Here, we divide the
dataset into two groups of training and validation. This lets us know if our
model is overfitting or underfitting.
4. If there are many features or there is multicollinearity among the variables,
feature selection techniques like lasso regression and ridge regression can help.
Regression analysis provides two main advantages:

i) it tells us the relationship between the input and output variable,
ii) it shows the weight of an independent variable’s effect on a dependent
variable.
Cautions about Regression
While regression is a very useful and powerful tool, it is also commonly

misused. The main things we need to keep in mind when interpreting our results are:
1. Linearity assumption
2. Association And/Or correlation do not mean Causation
3. Extrapolation
4. Outliers and influential points
Linearity
Remember, it is always important to plot a scatter diagram first. If the scatter plot
indicates that there is a linear relationship between the variables, then it is reasonable
to use the methods we are discussing.
Correlation Does Not Imply Causation
Even when we do have an apparent linear relationship and find a reasonable value of r,
there can always be confounding or lurking variables at work. Be wary of spurious
correlations and make sure the connection you are making makes sense!
There are also often situations where it may not be clear which variable is causing
which. Does lack of sleep lead to higher stress levels or does high stress levels lead to
lack of sleep? Which came first, the chicken or the egg? Sometimes these may not be
answerable, but at least we are able to show an association there.
Extrapolation
Remember, it is always important to plot a scatter diagram first. If the scatter plot
indicates that there is a linear relationship between the variables, then it is reasonable
to use a best fit line to make predictions for y given x within the domain of x-values in
the sample data, but not necessarily for x-values outside that domain. The process of
predicting inside of the observed x values observed in the data is called interpolation.
The process of predicting outside of the observed x values observed in the data is
called extrapolation.
Outliers and Influential Points
In some data sets, there are values (observed data points) that may appear to be
outliers x or y. Outliers are points that seem to stick out from the rest of the group in a
single variable. Besides outliers, a sample may contain one or a few points that are
called influential points. Influential points are observed data points that do not follow
the trend of the rest of the data. These points may have a big effect on the calculation
of the slope of the regression line. To begin to identify an influential point, you can
remove it from the data set and see if the slope of the regression line is changed
significantly.
How do we handle these unusual points? Sometimes they should not be included in
the analysis of the data. It is possible that an outlier or influential point is a result of
erroneous data. Other times it may hold valuable information about the population
under study and should remain included in the data. The key is to examine carefully
what causes a data point to be an outlier and/or influential point.
Identifying Outliers and/or Influential Points
Computers and many calculators can be used to identify outliers from the data.
Computer output for regression analysis will often identify both outliers and
influential points so that you can examine them.
We know how to find outliers in a single variable using fence rules and boxplots.
However, we would like some guideline as to how far away a point needs to be in
order to be considered an influential point. They also have large “errors”, where the
“error” or residual is the vertical distance from the line to the point. As a rough rule of
thumb, we can flag any point that is located further than two standard deviations
above or below the best-fit line as an outlier. The standard deviation used is the
standard deviation of the residuals or errors.
We can do this visually in the scatter plot by drawing an extra pair of lines that are
two standard deviations above and below the best-fit line. Any data points that are
outside this extra pair of lines are flagged as potential outliers. Or we can do this
numerically by calculating each residual and comparing it to twice the standard
deviation. The graphical procedure is shown in the example below, followed by the
numerical calculations in the next example. You would generally need to use only one
of these methods.
Additional Regression Models :
Advanced Analytical Theory and Methods:

Classification:
Introduction
What Is Classification Technique?
Classification is a technique in data science used by data scientists to categorize data

into a given number of classes. This technique can be performed on structured or
unstructured data and its main goal is to identify the category or class to which a new
data will fall under.
Classification can be of three types: binary classification, multiclass classification,

multilabel classification.
The Classification Algorithms
Decision Tree
Naive Bayes
k Nearst Nighbours
Support Vector machine
Neural Networks
Random Forest
Neural Network
First, there is neural network. It is a set of algorithms that attempt to identify the underlying
relationships in a data set through a process that mimics how human brain operates. In data
science, neural networks help to cluster and classify complex relationship. Neural networks could
be used to group unlabelled data according to similarities among the example inputs and classify
data when they have a labelled dataset to train on.
K-Nearest Neighbors
KNN (K-Nearest Neighbors) becomes one of many algorithms used in data mining and machine
learning, KNN is a classifier algorithm in which the learning is based on the similarity of data (a
vector) from others. It also could be used to store all available cases and classifies new cases based
on a similarity measure (e.g., distance functions).
Decision Tree
Decision tree algorithm is included in supervised learning algorithms. This algorithm could be
used to solve regression and other classification problems. Decision tree builds classification or
regression models in the form of a tree structure. It breaks down a dataset into smaller and smaller
subsets while at the same time an associated decision tree is incrementally developed. The purpose
of using decision tree algorithm is to predict class or value of target variable by learning simple
decision rules concluded from prior data.
Random Forest
Random forests are an ensemble learning method for classification, regression and other tasks that
operates by constructing multiple decision trees at training time. For classification task, the output
from the random forest is the class selected by most trees. For the regression task, the mean or
mean prediction of each tree is returned. Random forests generally outperform decision trees but
have lower accuracy than gradient boosted trees. However, the characteristics of the data can
affect its performance.
Naïve Bayes
Naive Bayes is a classification technique based on Bayes' theorem with the assumption of
independence between predictors. In simple terms, the Naive Bayes classifier assumes that the
presence of certain features in a class is not related to the presence of other features. Classifier
assumes that the presence of a particular feature in a class is unrelated to the presence of any other
feature. It's updating knowledge step by step with new info.
Decision Tree
https://www.mastersindatascience.org/learning/machine-learning-algorithms/decision-
tree/#:~:text=A%20decision%20tree%20is%20a,that%20contains%20the%20desired%20categori
zation.
Naive Bayes
https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/
SVM
https://www.spiceworks.com/tech/big-data/articles/what-is-support-vector-
machine/#:~:text=A%20support%20vector%20machine%20(SVM)%20is%20a%20machine%20l
earning%20algorithm,classes%2C%20labels%2C%20or%20outputs.
KNN:
https://www.analyticsvidhya.com/blog/2018/03/introduction-k-neighbours-algorithm-
clustering/#:~:text=The%20K%2DNearest%20Neighbors%20(KNN)%20algorithm%20is%20a%
20popular,have%20similar%20labels%20or%20values.
Unit - 4
Advanced Analytical Theory and Methods - Time Series Analysis:
Time Series Analysis

ARIMA Model
Additional Methods.
Advanced Analytical Theory and Methods - Text Analysis

Text Analysis Steps
A Text analysis Example
Collecting Raw Text
Representing Text
Term Frequency - Inverse Document Frequency (TFIDF)
Categorizing Documents by Topics
Determining Sentiments
Gaining Insights
We can predict ,
Daily Stock Price
Weekly interest rates
Sales figures
Where the outcome (independent variable) is dependent on time. In such
scenarios, we use Time Series forecasting
What is Time Series?

A Time Series data for stock price analysis may look like this:
A B
1 Date Close
2 1/4/2017 139.92
3 2/4/2017 139.58
4 3/4/2017 139.59
5 4/4/2017 141.42
6 5/4/2017 140.96
7 6/4/2027 142.27
8 7/4/2017 143.81
9 8/4/2017 142.62
10 9/4/2017 143.47
The stock prices change everyday
A Time Series is a sequence of data being recorded at specific time intervals.

These data points (past values) are analyzed to forecast a future
It is time dependent.
Time series is affected by four main components
Trend
Seasonality
Cyclicity
Irregularity
Trend :
Trend is the increase or decrease in the series over a period of time. It
persists over a long period of time
Example: Population growth over the years can be seen as an upward trend.
Seasonality:
Regular pattern of up and down fluctuations , it is a short term variation
occuring due to seasonal factors
Example: Sales of ice-cream increases during summer season.
Cyclicity:
It is a medium term variation caused by circumstances, which repeat in
irregular intervals.
Example: 5 years of economic growth, followed by 2 years of economic recession,

followed be 7 years of economic growth followed by 1 year of economic recession.
Irregularity:
It refers to variations which occur due to unpredictable factors and also do not
repeat in particular patterns.
Example: Variations caused by incidents like earthquake , floods , war etc.
When Not to use Time Series Analysis:
There are various conditions where you should not use time series.
1- When the values are constant over a period of time.

Y= f(x)
2- When values can be represented by known functions like cosx, sinx etc:
Stationarity of Time Series:

A Non stationary Tiem series has trend and seasonality components, which
will affect the forcasting of Time Series.
When a Time series is stationary, we can identify previously unnoticed components to

strengthen their forcasting.
How do you differentiate between a stationary and Non- stationary Time series:
Stationary of Time series depends on:

Mean
Variance
Co- Variance
The mean of the series should not be a function of time rather should be a constant.
For stationary : Mean is constant with Time

For Non-stationary : Mean is increasing with time.
The Covariance of the ith term and the (i+m)th term should not be a function of time.
Examples of Time Series Analysis:
Rainfall measurements
Stock prices
Number of sunspots
Annual Retail sales
Monthly Subscribers
Heartbeat per minute
Steps In Time series Analysis:
Data Import
Data Cleaning
Stationary Check
Model Training
Prediction
Tuning
Text Analysis:
Text analysis is the process of using computer systems to read and understand human
- written text for business insights.
Text analysis software can independently classify, sort, and extract
information from text to identify patterns, relationships, sentiments, and other
actionable knowledge.
You can use text analysis to efficiently and accurately process multiple text-
based sources such as emails, documents, social media content, and product reviews,
like a human would.
Why is text analysis important?
Businesses use text analysis to extract actionable insights from various unstructured
data sources. They depend on feedback from sources like emails, social media, and
customer survey responses to aid decision making. However, the immense volume of
text from such sources proves to be overwhelming without text analytics software.
With text analysis, you can get accurate information from the sources more
quickly. The process is fully automated and consistent, and it displays data you can
act on. For example, using text analysis software allows you to immediately detect
negative sentiment on social media posts so you can work to solve the problem
Sentiment analysis
Sentiment analysis or opinion mining uses text analysis methods to understand the
opinion conveyed in a piece of text.
You can use sentiment analysis of reviews, blogs, forums, and other online
media to determine if your customers are happy with their purchases.
Sentiment analysis helps you spot new trends, track sentiment changes, and
tackle PR issues.
By using sentiment analysis and identifying specific keywords, you can track
changes in customer opinion and identify the root cause of the problem.
Record management
Text analysis leads to efficient management, categorization, and searches of

documents.
This includes automating patient record management, monitoring brand

mentions, and detecting insurance fraud.
For example, LexisNexis Legal & Professional uses text extraction to identify
specific records among 200 million documents.
Personalizing customer experience
You can use text analysis software to process emails, reviews, chats, and other text-
based correspondence.
With insights about customers’ preferences, buying habits, and overall brand
perception, you can tailor personalized experiences for different customer segments.
Text analysis software works on the principles of deep learning and natural
language processing.
Deep learning
Artificial intelligence is the field of data science that teaches computers to

think like humans.
Machine learning is a technique within artificial intelligence that uses specific
methods to teach or train computers.
Deep learning is a highly specialized machine learning method that uses

neural networks or software structures that mimic the human brain. Deep learning
technology powers text analysis software so these networks can read text in a similar
way to the human brain.
Natural language processing
Natural language processing (NLP) is a branch of artificial intelligence that gives

computers the ability to automatically derive meaning from natural, human-created
text.
It uses linguistic models and statistics to train the deep learning technology to
process and analyze text data, including handwritten text images.
NLP methods such as optical character recognition (OCR) convert text

images into text documents by finding and understanding the words in the images.
Text analysis techniques:
The text analysis software uses these common techniques
Text classification
Text extraction
Topic modeling
PII redaction
Text classification
In text classification, the text analysis software learns how to associate certain
keywords with specific topics, users’ intentions, or sentiments. It does so by using the
following methods:
Rule-based classification assigns tags to the text based on predefined rules

for semantic components or syntactic patterns.
Machine learning-based systems work by training the text analysis software

with examples and increasing their accuracy in tagging the text.
They use linguistic models like Naive Bayes, Support Vector Machines, and
Deep Learning to process structured data, categorize words, and develop a semantic
understanding between them.
For example, a favorable review often contains words like good,

fast, and great. However, negative reviews might contain words like unhappy,
slow, and bad. Data scientists train the text analysis software to look for such specific
terms and categorize the reviews as positive or negative. This way, the customer
support team can easily monitor customer sentiments from the reviews.
Text extraction
Text extraction scans the text and pulls out key information. It can identify keywords,
product attributes, brand names, names of places, and more in a piece of text. The
extraction software applies the following methods:
Regular expression (REGEX): This is a formatted array of symbols that serves as

a precondition of what needs to be extracted.
Conditional random fields (CRFs): This is a machine learning method that

extracts text by evaluating specific patterns or phrases. It is more refined and flexible
than REGEX.
For example, you can use text extraction to monitor brand mentions on social media.
Manually tracking every occurrence of your brand on social media is impossible.
Text extraction will alert you to mentions of your brand in real time.
Topic modeling
Topic modeling methods identify and group related keywords that occur in an
unstructured text into a topic or theme. These methods can read multiple text
documents and sort them into themes based on the frequency of various words in the
document. Topic modeling methods give context for further analysis of the documents.
For example, you can use topic modeling methods to read through your scanned
document archive and classify documents into invoices, legal documents, and
customer agreements. Then you can run different analysis methods on invoices to
gain financial insights or on customer agreements to gain customer insights.
PII redaction
PII redaction automatically detects and removes personally identifiable information
(PII) such as names, addresses, or account numbers from a document. PII redaction
helps protect privacy and comply with local laws and regulations.
For example, you can analyze support tickets and knowledge articles to detect and
redact PII before you index the documents in the search solution. After that, search
solutions are free of PII in documents.
Stages in text analysis

To implement text analysis, you need to follow a systematic process that goes through
four stages.
Stage 1—Data gathering
In this stage, you gather text data from internal or external sources.
Internal data
Internal data is text content that is internal to your business and is readily available—
for example, emails, chats, invoices, and employee surveys.
External data
You can find external data in sources such as social media posts, online reviews, news
articles, and online forums. It is harder to acquire external data because it is beyond
your control. You might need to use web scraping tools or integrate with third-party
solutions to extract external data.
Stage 2—Data preparation
Data preparation is an essential part of text analysis. It involves structuring raw text
data in an acceptable format for analysis. The text analysis software automates the
process and involves the following common natural language processing (NLP)
methods.
Tokenization
Tokenization is segregating the raw text into multiple parts that make semantic sense.
For example, the phrase text analytics benefits businesses tokenizes to the
words text, analytics, benefits, and businesses.
Part-of-speech tagging
Part-of-speech tagging assigns grammatical tags to the tokenized text. For example,
applying this step to the previously mentioned tokens results in text: Noun; analytics:
Noun; benefits: Verb; businesses: Noun.
Parsing
Parsing establishes meaningful connections between the tokenized words with
English grammar. It helps the text analysis software visualize the relationship between
words.
Lemmatization
Lemmatization is a linguistic process that simplifies words into their dictionary form,
or lemma. For example, the dictionary form of visualizing is visualize.
Stop words removal
Stop words are words that offer little or no semantic context to a sentence, such
as and, or, and for. Depending on the use case, the software might remove them from
the structured text.
Stage 3—Text analysis
Text analysis is the core part of the process, in which text analysis software processes
the text by using different methods.
Text classification
Classification is the process of assigning tags to the text data that are based on rules or
machine learning-based systems.
Text extraction
Extraction involves identifying the presence of specific keywords in the text and
associating them with tags. The software uses methods such as regular expressions
and conditional random fields (CRFs) to do this.
Stage 4—Visualization
Visualization is about turning the text analysis results into an easily understandable
format. You will find text analytics results in graphs, charts, and tables. The
visualized results help you identify patterns and trends and build action plans. For
example, suppose you’re getting a spike in product returns, but you have trouble
finding the causes. With visualization, you look for words such as defects, wrong size,
or not a good fit in the feedback and tabulate them into a chart. Then you’ll know
which is the major issue that takes top priority.
Text analysis vs. text analytics
Text analytics helps you determine if there’s a particular trend or pattern from the
results of analyzing thousands of pieces of feedback. Meanwhile, you can use text
analysis to determine whether a customer’s feedback is positive or negative.
Text analysis vs. text mining
There is no difference between text analysis and text mining. Both terms refer to the
same process of gaining valuable insights from sources such as email, survey
responses, and social media feeds.

Big Data Analysis

Uploaded by

Copyright:

Available Formats

Big Data Analysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Analysis

Uploaded by

Copyright:

Available Formats

Unit - 1

Introduction to Big Data Analytics:

Big Data Over view:

Some of the Big Data examples :

Data stored in a relational database management system is one example of

An ‘Employee’ table in a database is an example of Structured Data

Employee_ID Employee_Name Gender Department Salary_In_lacs

The output returned by ‘Google Search’ is an Examples Of Un-structured Data

Personal data stored in an XML file is an Examples Of Semi-structured Data

Big data can be described by the following characteristics:

 Improved customer service,

State of the Practice in Analytics:

The State of the Practice of Analytics is constantly evolving, driven by advances in

Key Trends in Analytics:

Key roles for the New Big Data Ecosystems:

Big Data Analyst

Big Data Scientist

Big Data Engineer

Other Big Data roles

 Using analytics to understand customer behavior in order to optimize the

 Predicting future trends in order to make better business decisions

 Improving marketing campaigns by understanding what works and what

 Increasing operational efficiency by understanding where bottlenecks are and

 Detecting fraud and other forms of misuse sooner

Data Analytics Life Cycle:

How Big Data Analytics works:

Analytical Software that includes,

Types of Big Data Analytics:

Key Big Data Analytics Technologies and Tools:

Data Analytics Life Cycle overview:

Data Analytics Life Cycle:

Advanced Analytical Theory and Methods- Clustering:

Advanced Analytical Theory and Methods - Regression:

Regression analysis may require many techniques for modeling and

Need of Regression Analysis Techniques:

Regression analysis helps organizations to understand what their data points

Correlations and Relationships:

The first and foremost strength of a relationship is a correlation. The strength

The relationship can be posistive or it can be an inverse relationship, that is the

The coefficient of correlation r is mathematically computed as follows,

Types of regression techniques

Simple Linear Regression:

Y = f(a0, a1) - a0+a1.x

Where a0 is a constant and a1 linearity coefficient.

Simple linear regression is performed when the requirement is prediction of values of

If we take two variables, X and Y, there will be two regression lines:

The purpose of regression analysis is to come up with an equation of a line

Y`i = a`0 + a`1 xi,

How is the best fit line achieved?

where yi is the actual value and yi_cap is the predicted value.

Assumptions of linear regression

 Independent and dependent variables should be linearly related.

Non- Linear Regression:

The relationship between the variables may also be curvilinear.

Follow this link for more on non linear regression.

Modelling Possibilities using Regression:

Logistic regression analysis is generally used to find the probability of an event. It is

Important points to note

 Logistic regression is mostly used in classification problems.