Unit 3

Unit 3
Exploratory Data Analysis

• Exploratory data analysis or “EDA” is a critical first step in analyzing

the data from an experiment. Here are the main reasons we use
EDA:
• detection of mistakes
• checking of assumptions
• preliminary selection of appropriate models
• determining relationships among the explanatory variables,
and
• assessing the direction and rough size of relationships
between explanatory and outcome variables.
Typical data format and the types of EDA
• The data from an experiment are generally collected into a rectangular

array (e.g., spreadsheet or database), most commonly with one row per
experimental subject and one column for each subject identifier, outcome
variable, and explanatory variable.
• Each column contains the numeric values for a particular quantitative

variable or the levels for a categorical variable.
• People are not very good at looking at a column of numbers or a whole

spreadsheet and then determining important characteristics of the data.
They find looking at numbers to be tedious, boring, and/or overwhelming
• Exploratory data analysis techniques have been devised as an aid in this
situation. Most of these techniques work in part by hiding certain aspects of
the data while making other aspects more clear.
• Exploratory data analysis is generally cross-classified in two ways. First, each

method is either non-graphical or graphical. And second, each method is
either univariate or multivariate.
• Non-graphical methods generally involve calculation of summary statistics,

while graphical methods obviously summarize the data in a diagrammatic or
pictorial way.
• Univariate methods look at one variable (data column) at a time, while

multivariate methods look at two or more variables at a time to explore
relationships
The four types of EDA
• The four types of EDA are
• Univariate non-graphical
• Multivariate nongraphical
• Univariate graphical
• Multivariate graphical.
Univariate non-graphical EDA
• 1. Univariate Non-graphical:
• This is the simplest form of data analysis as during this we use just one
variable to research the info.
• The standard goal of univariate non-graphical EDA is to know the underlying
sample distribution/ data and make observations about the population.
• Outlier detection is additionally part of the analysis.
• The characteristics of population distribution include:
• Categorical data: The characteristics of interest for a categorical variable are simply
the range of values and the frequency (or relative frequency) of occurrence for each
value
• Therefore the only useful univariate non-graphical techniques for categorical variables
is some form of tabulation of the frequencies, usually along with calculation
• Central tendency: The central tendency or location of distribution has got to do with
typical or middle values. The commonly useful measures of central tendency are
statistics called mean, median, and sometimes mode during which the foremost
common is mean
• Spread: Spread is an indicator of what proportion distant from the middle we are to
seek out the find the info values. the quality deviation and variance are two useful
measures of spread.
Multivariate non-graphical EDA
• The multivariate non-graphical exploratory data analysis technique is usually
used to show the connection between two or more variables in the form of
either cross-tabulation or statistics.
• For categorical data, an extension of tabulation called cross-tabulation is
extremely useful. For two variables, cross-tabulation is preferred by making a
two-way table with column headings that match the amount of one variable
and row headings that match the amount of the opposite two variables.
• Cross-tabulation
• For categorical data (and quantitative data with only a few different values) an
extension of tabulation called cross-tabulation is very useful. For two
variables, cross-tabulation is performed by making a two-way table with
column headings that match the levels of one variable and row headings that
match the levels of the other variable, then filling in the counts of all subjects
that share a pair of levels.
• The two variables might be both explanatory, both outcome, or one of each.
• We can easily see that the total number of young females is 2, and we can
calculate, e.g., the corresponding cell percentage is 2/11 × 100 = 18.2%, the
row percentage is 2/5×100 = 40.0%, and the column percentage is 2/7×100 =
28.6%.
Univariate graphical EDA
• Univariate graphical: Non-graphical methods are quantitative and objective, they are not
able to give the complete picture of the data; therefore, graphical methods are used
more as they involve a degree of subjective analysis, also are required. Common sorts of
univariate graphics are:
• Histogram: The foremost basic graph is a histogram, which may be a barplot during
which each bar represents the frequency (count) or proportion (count/total count) of
cases for a variety of values. Histograms are one of the simplest ways to quickly learn a
lot about your data, including central tendency, spread, modality, shape and outliers.
• Stem-and-leaf plots: An easy substitute for a histogram may be stem-and-leaf plots. It
shows all data values and therefore the shape of the distribution.
• Boxplots: Another very useful univariate graphical technique is that the boxplot.
Boxplots are excellent at presenting information about central tendency and show robust
measures of location and spread also as providing information about symmetry and
outliers, although they will be misleading about aspects like multimodality. One among
the simplest uses of boxplots is within the sort of side-by-side boxplots.
Multivariate graphical EDA
• Multivariate graphical: Multivariate graphical data uses graphics to display
relationships between two or more sets of knowledge. The sole one used commonly
may be a grouped barplot with each group representing one level of 1 of the variables
and every bar within a gaggle representing the amount of the opposite variable.
• Other common sorts of multivariate graphics are:
• Scatterplot: For 2 quantitative variables, the essential graphical EDA technique is that
the scatterplot , sohas one variable on the x-axis and one on the y-axis and therefore
the point for every case in your dataset.
• Run chart: It’s a line graph of data plotted over time.
• Heat map: It’s a graphical representation of data where values are depicted by color.
• Multivariate chart: It’s a graphical representation of the relationships between
factors and response.
• Bubble chart: It’s a data visualization that displays multiple circles (bubbles) in two-
dimensional plot.
Exploratory
Data Analysis
(EDA)
HYPOTHESIS TESTING VERSUS EXPLORATORY DATA ANALYSIS
Hypotheses tests relationships between variables.
 E.g. Cell-phone executives are interested in whether a recent
increase in the fee structure has led to a decrease in market
share.
In this case, the analyst would test the hypothesis that market
share has decreased, and would therefore use hypothesis
testing procedures.
 Many statistical hypothesis testing procedures are available.
 Especially when confronted with unknown, large databases,
analysts often prefer to use Exploratory Data Analysis (EDA), or
graphical data analysis.
Exploratory Data Analysis 15

 Exploratory Data Analysis (EDA) is that
part of statistical practice concerned with
reviewing, communicating and using data
where there is a low level of knowledge
about its cause system.
 Many EDA techniques have been adopted
into data mining and are being taught to
young students as a way to introduce
them to statistical thinking.
- www.wikipedia.org
Objectives of EDA
EDA allows the analyst to-
 delve into the data set;
 examine interrelationships among
attributes;
 identify interesting subsets of the
observations;
 develop an initial idea of possible
associations amongst the predictors, as well
as between the predictors and the target
variable.
HYPOTHESIS TESTING VERSUS EXPLORATORY DATA ANALYSIS
Exploratory Data Analysis (EDA) is an approach/philosophy for
data analysis that employs a variety of techniques (mostly
graphical and statistical) to maximize
1. insight into a data set;
2. uncover underlying structure;
3. extract important variables;
4. detect outliers and anomalies;
5. test underlying assumptions;
6. develop accurate models;

GETTING TO KNOW THE DATA SET
Graphs, plots, and tables often uncover important
relationships.
 Relationships that could indicate important areas for
further investigation.
 We will use exploratory methods to delve into the churn
data set from the UCI Repository of Machine Learning
Databases at the University of California
 Churn, also called attrition, is a term used to indicate a
customer leaving the service of company.

Churn data set
The data set contains 20 predictors.
• State: Categorical, for the 50 states and the District of Columbia.
• Account length: Integer-valued, how long account has been
active.
• Area code: Categorical
• Phone number: Essentially a surrogate for customer ID.
• International plan: categorical, yes or no.
• Voice mail plan: categorical, yes or no.
• Number of voice mail messages: Integer-valued.
• Total day minutes: Continuous, minutes customer used service
during the day.
• Total day calls: Integer-valued.
• Total day charge: Continuous, perhaps based on above two
variables.

Churn data set
 Total eve minutes: Continuous, minutes customer used service during
the evening.
 Total eve calls: Integer-valued.
 Total eve charge: Continuous, based on above two variables.
 Total night minutes: Continuous, minutes customer used service during
the night.
 Total night calls: Integer-valued.
 Total night charge: Continuous, perhaps based on above two variables.
 Total international minutes: Continuous, minutes customer used
service to make international calls.
 Total international calls: Integer-valued.
 Total international charge: Continuous, based on above two variables.
 Number of calls to customer service: Integer-valued.
 Churn: Target. Indicator of whether customer has left company (true
or false).

Field values of the first 10 records in the churn data set

Summarization and visualization of the churn data set

Summarization and visualization of the churn data set

Feel of Churn data
 The variable Phone uses only seven digits.
 There are two flag variables.
 Most of our variables are continuous.
 The response variable Churn is a flag variable
having two values, True and False.

EXPLORING CATEGORICAL VARIABLES
 Bar graph in shows the counts and percentages of
customers who churned (true) and who did not churn
(false).
 Only a minority (14.49%) of our customers have left
service.
 Our task is to identify patterns in the data that will help to
reduce the proportion of churners.

Primary reasons for performing EDA is
 to investigate the variables,
 examine the distributions of the categorical
variables,
 look at the histograms of the numeric
variables, and
 explore the relationships among sets of
variables.
Overall objective to develop a model of the type

of customer likely to churn
Investigation of categorical variable International Plan
Comparison bar chart of churn proportions, by
international plan participation
Greater proportion of International Plan holders are

churning, but it is difficult to be sure.

Comparison bar chart of churn proportions, by international plan
participation, with equal bar length.
Clearly, those who have selected the International Plan have a

greater chance of leaving the company’s service

 Graphics above tell us that International Plan holders tend to churn more
frequently, but they do not quantify the relationship
 Use a contingency table as both variables are categorical

The graphical counterpart of the contingency table is the
clustered bar chart.
Clearly, the proportion of churners is greater among those

belonging to the International plan.

Another useful graphic for comparing two categorical variables is the
comparative pie chart.

Contrast with prev. Table, the contingency table with row percentages
Proportion of International Plan holders is greater among churners

Comparative pie chart associated with above Table

To summarize, this EDA on the International Plan
has indicated that
1. perhaps we should investigate what is it about our
international plan that is inducing our customers
to leave;
2. we should expect that, whatever data
mining/machine learning algorithms we use to
predict churn, the model will probably include
whether or not the customer selected the
International Plan.

Let us now turn to the Voice Mail Plan
Without the Voice Mail Plan are churners, as compared to customers who do have
the Voice Mail Plan.

To summarize, this EDA on the Voice Mail Plan has
indicated that
1. perhaps we should enhance our Voice Mail Plan still
further, or make it easier for customers to join it, as an
instrument for increasing customer loyalty;
2. whatever data mining algorithms/machine learning we
use to predict churn, the model will probably include
whether or not the customer selected the Voice Mail Plan
- confidence in this expectation is perhaps not
quite as high as for the International Plan

 May also explore the two-way interactions among categorical
variables with respect to churn.

Statistics for multilayer clustered bar chart

 A directed web graph of the relationships between International Plan holders,
Voice Mail Plan holders, and churners
• Web graphs are graphical representations of the relationships between
categorical variables.
Greater proportion of International Plan holders choose to churn

EXPLORING NUMERIC VARIABLES
 Next, we turn to an exploration of the numeric predictive variables.
 Unfortunately, the usual type of histogram does not help us
determine whether the predictor variables are associated with the
target variable.

 To explore whether a predictor is useful for predicting the target variable, use an
overlay histogram,
 Which is a histogram where the rectangles are colored according to the values of
the target variable.

“stretching out” the rectangles that have low counts enables better
definition and contrast.
Customer called three times or less - lower churn rate

Customers called four or more times – higher churn rate .
This EDA on the customer service calls has indicated that
1. Carefully track the number of customer service calls made
by each customer. By the third call, specialized incentives
should be offered to retain customer loyalty, because, by
the fourth call, the probability of churn increases greatly;
2. Whatever algorithms we use to predict churn, the model
will probably include the number of customer service
calls made by the customer.

Important note: Data analysts always provide a non-normalized histogram along
with the normalized histogram, because the normalized histogram does not provide
any information on the frequency distribution of the variable.
Indicates that the churn rate for Shows that there are only two
customers logging nine service customers with this number of calls
calls is 100%;

Let us now turn to the Day Minutes

The normalized histogram of Day Minutes shows that high
day-users tend to churn at a higher rate. Therefore,
1. we should carefully track the number of day minutes used
by each customer. As the number of day minutes passes
200, we should consider special incentives;
2. we should investigate why heavy day-users are tempted to
leave;
3. we should expect that our eventual model will include day
minutes as a predictor of churn.

slight tendency for customers with higher evening minutes
to churn

Graph indicates that there is no obvious association
between churn and night minutes

The lack of obvious association at the EDA stage between a predictor
and a target variable is not sufficient reason to omit that predictor
from the model.
predictor International Calls with churn overlay, do not indicate

strong graphical evidence of predictive importance of International
Calls.
 However, a t-test for the difference in mean number of international calls
for churners and non-churners is statistically significant
 This variable is indeed useful for predicting churn:
 Churners tend to place a lower mean number of international calls
 Omitting international calls – would have committed a mistake

 A hypothesis test, such as this t-test lies beyond the scope of EDA

EXPLORING MULTIVARIATE RELATIONSHIPS
 Scatter plots can be used for examination of the possible
multivariate associations
 Records above this diagonal line (customers high day minutes and evening
minutes), - higher proportion of churners than records below line.

SELECTING INTERESTING SUBSETS OF THE DATA FOR FURTHER INVESTIGATION
 Consider the records inside the rectangle partition - indicates a high-churn area
 These records represent combination of a high number of customer service calls and
a low number of day minutes used.
 This group of customers could not have been identified with univariate exploration

Graphical EDA can uncover subsets of records that call for further
investigation
 About 65% (115 of 177) of the selected records are churners
• Those with high customer service calls and low day minutes have a
65% probability of churning

 Compare this to the records with high customer

service calls and high day minutes
 About 26% of customers with high customer service
calls and high day minutes are churners

To summarize, the strategy we implemented here is as

follows:
1. Generate multivariate graphical EDA, such as scatter
plots with a flag overlay.
2. Use these plots to uncover subsets of interesting records.
3. Quantify the differences by analyzing the subsets of
records.

USING EDA TO UNCOVER ANOMALOUS FIELDS
 EDA can uncover strange or anomalous records or fields that the
earlier data cleaning phase may have missed.
 Area code field in the contain numerals, can also be categorical
variables as they can classify customers according to geographic
location
 Contains only three different values for all the records, 408, 415,
and 510
 Would not be anomalous - customers all lived in

California
USING EDA TO UNCOVER ANOMALOUS FIELDS
 Three area codes seem to be distributed more or less evenly across all
the states and the District of Columbia
 Chi-square test has a p-value of 0.608 supporting the suspicion that the
area codes are distributed randomly across all the states
 Domain experts might be able to explain this type of behavior,
 Possible that the field just contains bad data
 Further communication with someone familiar with the data history, or a
domain expert, is called for.

BINNING BASED ON PREDICTIVE VALUE
Bin the customer service calls variable into two classes,
low (fewer than four) and high (four or more).
 binning of customer service calls created a flag variable
with two values, high and low.

 trying to determine relationship between evening minutes
and churn
 Can we use binning to help tease out a signal from this
noise?

 Binning is an art, requiring judgment.
 Where can I insert boundaries between the bins that will
maximize the difference in churn proportions?
 Did the binning manage to tease out a signal?
 Can answer this by constructing a contingency table of
EveningMinutes_Bin with Churn

 High evening minutes group has nearly double the churn proportion compared to the low
evening minutes group

DERIVING NEW VARIABLES: FLAG VARIABLES
 Deriving new variables is a data preparation activity
 EDA for usefulness of the new derived variables in predicting the target
variable may be assessed
therefore derive a flag variable

 Derive new VoiceMailMessages_Flag variables
If Voice Mail Messages> 0 then
VoiceMailMessages_Flag=1; otherwiseVoiceMailMessages_Flag = 0.
 Results are exactly the same

 VoiceMailMessages_Flag has identical values as Voice Mail Plan
 Derived variable is not useful for further analysis
 Both high day minutes and high evening minutes churns at a greater rate.
 Nice to quantify this claim
 Idea is to
1. estimate the equation of the straight line;
2. use the equation to separate the records (method portable other data set)
Estimate the equation of the line
̂y = 400 − 0.6x

 Estimate the equation of the line
̂y = 400 − 0.6x
 Create a flag variable HighDayEveMins_Flag as follows:
If Day Minutes > 400–0.6 Evening Minutes then
HighDayEveMins_Flag = 1; otherwise HighDayEveMins_Flag = 0.
 Data point above the line will have HighDayEveMins_Flag=1, while the data
points below the line will have HighDayEveMins_Flag=0.
 Shows the highest churn proportion (70.4%)

 However, this 70.4% churn rate is restricted to a subset of fewer than 200 records

DERIVING NEW VARIABLES: NUMERICAL VARIABLES
 New numerical variable which combines Customer Service Calls and International
Calls whose values will be the mean of the two fields.
 International Calls have a larger mean and standard deviation than Customer
Service Calls
 International Calls would thereby be more heavily weighted
 We first need to standardize
 CSCInternational_Z indicates that it will be useful for predicting churn.

USING EDA TO INVESTIGATE CORRELATED PREDICTOR VARIABLES
 Two variables x and y are linearly correlated if an increase in
x is associated with either an increase in y or a decrease in y.
 The correlation coefficient r quantifies the strength and
direction of the linear relationship between x and y.
 The threshold for significance of the correlation coefficient r
depends not only on the sample size but also on data mining
algorithm
 Avoid feeding correlated variables to one’s data mining and
statistical models.
 Using correlated variables will cause the model to become
unstable and deliver unreliable results

 If two variables are correlated does not mean that we should omit
one of them.
 Strategy For Handling Correlated Predictor Variables At The EDA
Stage
1. Identify any variables that are perfectly correlated (i.e., r = 1.0
or r = −1.0). Do not retain both variables in the model, but rather
omit one.
2. Identify groups of variables that are correlated with each other.
Then, later, during the modelling phase, apply dimension-
reduction methods, such as Principal Components Analysis
(PCA) to these variables.
This strategy applies to uncovering correlation among the

predictors alone
Correlated variables can be investigated using a matrix plot

The correlation coefficient values and the p-values for each pairwise set of variables
 No any relationship between day minutes and day calls,

 No relation between day calls and day charge – odd - expected that, as the number
of calls increased, the number of minutes would tend to increase
 Linear relationship between day minutes and day charge

• Using Minitab’s regression tool, we may express this function as the estimated regression
equation: “Day charge equals 0.000613 plus 0.17
 As day charge is perfectly correlated with day minutes, eliminate one of the two
 Also eliminate evening charge, night charge, and international charge.
 Proceeded to modelling phase without first uncovering these correlations, our
models may have returned incoherent results
 Reduced the number of predictors from 20 to 16
 Dimensionality of the solution space is reduced – efficiently & optimal solution

 Data analyst should turn to step 2 of the strategy, and identify any other correlated
predictors, handling with principal components analysis.
 The correlation of each numerical predictor with every other numerical predictor
should be checked, if feasible.
 Correlations with small p-values should be identified.
 Table shows - A subset of this procedure

SUMMARY OF OUR EDA
 The four charge fields are linear functions of the minute fields, and
should be omitted.
 The area code field and/or the state field are anomalous, and should
be omitted until further clarification is obtained.
Insights with respect to churn are as follows:
 Customers with the International Plan tend to churn more frequently.
 Customers with the Voice Mail Plan tend to churn less frequently.
 Customers with four or more Customer Service Calls churn more than
four times as often as the other customers.

SUMMARY OF OUR EDA
 Customers with both high DayMinutes and high Evening Minutes
tend to churn at a higher rate than the other customers.
 Customers with both high Day Minutes and high Evening Minutes
churn at a rate about six times greater than the other customers.
 Customers with low Day Minutes and high Customer Service Calls
churn at a higher rate than the other customers.
 Customers with lower numbers of International Calls churn at a
higher rate than do customers with more international calls.
 For the remaining predictors, EDA uncovers no obvious association
of churn.

Thank You !!!

Unit 3

Uploaded by

Copyright:

Available Formats

Unit 3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 3

Uploaded by

Copyright:

Available Formats

Unit 3

Exploratory Data Analysis

• Exploratory data analysis or “EDA” is a critical first step in analyzing

• The data from an experiment are generally collected into a rectangular

• Each column contains the numeric values for a particular quantitative

• People are not very good at looking at a column of numbers or a whole

• Exploratory data analysis is generally cross-classified in two ways. First, each

• Non-graphical methods generally involve calculation of summary statistics,

• Univariate methods look at one variable (data column) at a time, while

• The four types of EDA are

Exploratory Data Analysis 15

Exploratory Data Analysis 18

Exploratory Data Analysis 19

Exploratory Data Analysis 20

Exploratory Data Analysis 21

Exploratory Data Analysis 22

Exploratory Data Analysis 23

Exploratory Data Analysis 24

Exploratory Data Analysis 25

Exploratory Data Analysis 26

Overall objective to develop a model of the type

Greater proportion of International Plan holders are

Exploratory Data Analysis 28

Clearly, those who have selected the International Plan have a

Exploratory Data Analysis 29

Exploratory Data Analysis 30

Clearly, the proportion of churners is greater among those

Exploratory Data Analysis 31

Exploratory Data Analysis 32

Proportion of International Plan holders is greater among churners

Exploratory Data Analysis 33

Comparative pie chart associated with above Table

Exploratory Data Analysis 35

Exploratory Data Analysis 36

Exploratory Data Analysis 37

Exploratory Data Analysis 38

Exploratory Data Analysis 39

Greater proportion of International Plan holders choose to churn

Exploratory Data Analysis 40

Exploratory Data Analysis 41

Exploratory Data Analysis 42

Customer called three times or less - lower churn rate

Exploratory Data Analysis 44

Exploratory Data Analysis 45

Exploratory Data Analysis 46

Exploratory Data Analysis 47

Exploratory Data Analysis 48

Exploratory Data Analysis 49

predictor International Calls with churn overlay, do not indicate

 Omitting international calls – would have committed a mistake

Exploratory Data Analysis 51

Exploratory Data Analysis 52

Exploratory Data Analysis 53

Exploratory Data Analysis 55

 Compare this to the records with high customer

Exploratory Data Analysis 56

To summarize, the strategy we implemented here is as

Exploratory Data Analysis 57

 Would not be anomalous - customers all lived in

Exploratory Data Analysis 59