Unit 3
Unit 3
Unit 3
• Univariate non-graphical
• Multivariate nongraphical
• Univariate graphical
• Multivariate graphical.
Univariate non-graphical EDA
• 1. Univariate Non-graphical:
• This is the simplest form of data analysis as during this we use just one
variable to research the info.
• The standard goal of univariate non-graphical EDA is to know the underlying
sample distribution/ data and make observations about the population.
• Outlier detection is additionally part of the analysis.
• The characteristics of population distribution include:
• Categorical data: The characteristics of interest for a categorical variable are simply
the range of values and the frequency (or relative frequency) of occurrence for each
value
• Therefore the only useful univariate non-graphical techniques for categorical variables
is some form of tabulation of the frequencies, usually along with calculation
• Central tendency: The central tendency or location of distribution has got to do with
typical or middle values. The commonly useful measures of central tendency are
statistics called mean, median, and sometimes mode during which the foremost
common is mean
• Spread: Spread is an indicator of what proportion distant from the middle we are to
seek out the find the info values. the quality deviation and variance are two useful
measures of spread.
Multivariate non-graphical EDA
• The multivariate non-graphical exploratory data analysis technique is usually
used to show the connection between two or more variables in the form of
either cross-tabulation or statistics.
• For categorical data, an extension of tabulation called cross-tabulation is
extremely useful. For two variables, cross-tabulation is preferred by making a
two-way table with column headings that match the amount of one variable
and row headings that match the amount of the opposite two variables.
• Cross-tabulation
• For categorical data (and quantitative data with only a few different values) an
extension of tabulation called cross-tabulation is very useful. For two
variables, cross-tabulation is performed by making a two-way table with
column headings that match the levels of one variable and row headings that
match the levels of the other variable, then filling in the counts of all subjects
that share a pair of levels.
• The two variables might be both explanatory, both outcome, or one of each.
• We can easily see that the total number of young females is 2, and we can
calculate, e.g., the corresponding cell percentage is 2/11 × 100 = 18.2%, the
row percentage is 2/5×100 = 40.0%, and the column percentage is 2/7×100 =
28.6%.
Univariate graphical EDA
• Univariate graphical: Non-graphical methods are quantitative and objective, they are not
able to give the complete picture of the data; therefore, graphical methods are used
more as they involve a degree of subjective analysis, also are required. Common sorts of
univariate graphics are:
• Histogram: The foremost basic graph is a histogram, which may be a barplot during
which each bar represents the frequency (count) or proportion (count/total count) of
cases for a variety of values. Histograms are one of the simplest ways to quickly learn a
lot about your data, including central tendency, spread, modality, shape and outliers.
• Stem-and-leaf plots: An easy substitute for a histogram may be stem-and-leaf plots. It
shows all data values and therefore the shape of the distribution.
• Boxplots: Another very useful univariate graphical technique is that the boxplot.
Boxplots are excellent at presenting information about central tendency and show robust
measures of location and spread also as providing information about symmetry and
outliers, although they will be misleading about aspects like multimodality. One among
the simplest uses of boxplots is within the sort of side-by-side boxplots.
Multivariate graphical EDA
• Multivariate graphical: Multivariate graphical data uses graphics to display
relationships between two or more sets of knowledge. The sole one used commonly
may be a grouped barplot with each group representing one level of 1 of the variables
and every bar within a gaggle representing the amount of the opposite variable.
• Other common sorts of multivariate graphics are:
• Scatterplot: For 2 quantitative variables, the essential graphical EDA technique is that
the scatterplot , sohas one variable on the x-axis and one on the y-axis and therefore
the point for every case in your dataset.
• Run chart: It’s a line graph of data plotted over time.
• Heat map: It’s a graphical representation of data where values are depicted by color.
• Multivariate chart: It’s a graphical representation of the relationships between
factors and response.
• Bubble chart: It’s a data visualization that displays multiple circles (bubbles) in two-
dimensional plot.
Exploratory
Data Analysis
(EDA)
HYPOTHESIS TESTING VERSUS EXPLORATORY DATA ANALYSIS
Hypotheses tests relationships between variables.
E.g. Cell-phone executives are interested in whether a recent
increase in the fee structure has led to a decrease in market
share.
In this case, the analyst would test the hypothesis that market
share has decreased, and would therefore use hypothesis
testing procedures.
Many statistical hypothesis testing procedures are available.
Especially when confronted with unknown, large databases,
analysts often prefer to use Exploratory Data Analysis (EDA), or
graphical data analysis.
Without the Voice Mail Plan are churners, as compared to customers who do have
the Voice Mail Plan.
Indicates that the churn rate for Shows that there are only two
customers logging nine service customers with this number of calls
calls is 100%;
Records above this diagonal line (customers high day minutes and evening
minutes), - higher proportion of churners than records below line.
Consider the records inside the rectangle partition - indicates a high-churn area
These records represent combination of a high number of customer service calls and
a low number of day minutes used.
This group of customers could not have been identified with univariate exploration
Graphical EDA can uncover subsets of records that call for further
investigation
About 65% (115 of 177) of the selected records are churners
• Those with high customer service calls and low day minutes have a
65% probability of churning
High evening minutes group has nearly double the churn proportion compared to the low
evening minutes group
Both high day minutes and high evening minutes churns at a greater rate.
Nice to quantify this claim
Idea is to
1. estimate the equation of the straight line;
2. use the equation to separate the records (method portable other data set)
Estimate the equation of the line
̂y = 400 − 0.6x
Data point above the line will have HighDayEveMins_Flag=1, while the data
points below the line will have HighDayEveMins_Flag=0.
As day charge is perfectly correlated with day minutes, eliminate one of the two
Also eliminate evening charge, night charge, and international charge.
Proceeded to modelling phase without first uncovering these correlations, our
models may have returned incoherent results
Reduced the number of predictors from 20 to 16
Dimensionality of the solution space is reduced – efficiently & optimal solution