Descriptive Statsistics

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 34

What do the numbers tell?

Why Study
Statistics
• Technological developments, Revolution of Internet and social
networks, data generated from electronic devices, produce large
amount of data
• Large storage capacity
• Advancement in enormous computing power to effectively process
and analyze large amount of data
• Better data visualization from Business Intelligence
• Discovery of patterns and trends from this data can help
organizations gain competitive advantage in marketplace
Types of Statistics
• Descriptive statistics is concerned with Data Summarization
Graphs/Charts and tables.

• Inferential Statistics is the method used to talk about a


population parameter from a sample. It involves point
estimation, interval estimation, and hypothesis testing.
Some Key
Terms
• Population is the collection of • Parameter is the population characteristic
all possible observations of a of interest. For example, you are interested
specified of characteristic in the average income of a particular class
interest. of people. The average income of this
entire class of people is called a parameter.
• Sample is a subset of
population • Statistic is based on a sample to make
inferences about the population
parameter. The average income in of
population can be estimated by the
average income based on the sample. This
sample average is called a statistic.
Types of Data

Data

Categorical Numerical
(Quantitative
(Qualitative
E.g. Gender, Location of
store, Preference
)
) Discrete Continuous
E.g. Family size, Number of E.g. Waiting time, Length of
rooms in a hotel, number of a part produced
credit cards issued
Measurement Scales
• Nominal –e.g. Internet service provider

• Ordinal: e.g. Bond rating, employee designation

• Interval: e.g. IQ Score, Temperature in °C


or °F

• Ratio: e.g. cost of an item


Measure of central
Tendency
• You need the summary measures of central tendency
to
draw conclusions in the functional area of
meaningful
operation.
The most widely used measures of central tendency are
the Arithmetic Mean, Median and Mode.
Arithmetic Mean
• Arithmetic mean(called Mean) is defined as the sum of all
observations in a data set divided by the total number of
observations. For example, consider a data set containing
the following observations:
• In symbolic form mean is given b
Arithmetic Mean - Example
• The inner diameter of a particular grade of tire based on 5
sample measurements are as follows: (Figures in millimetres)

565, 570, 572, 568, 585

Applying the formula

We get mean = (565 + 570+572+568+585)/5 =572

• Caution: Arithmetic Mean is affected by extreme values or fluctuations


in sampling. It is not the best average to use when the data set contains
extreme values (Very high or very low values).
Median

• Median is the middle most observation when you arrange


data in ascending order of magnitude. Median is such 50% of
the observations are above the median and 50% of the
bservations are below the median.
• Median is a very useful measure for ranked data in the
context Preferences and rating. It is not affected by extreme
values (greater resistance to outliers)
• Median = (n+1)/2 th value of ranked data.
• n = Number of observations in the sample
Median - Example

• Marks obtained by 7 students in computer science


exam are given below: Compute the median.
45 40 60 80 90 65 55
• Arranging the data after ranking them
40 45 55 60 80 90
• 65
Median th value in this set = (7+1)/2 th
observation=
= 4th observation=60
• Hence median = 60 for this problem.
.
(n+1)/2
Mode
• Mode is that value which occurs most often. It has the
maximum frequency of occurrence. Mode also has
resistance to outliers.
• Mode is a very useful measure when you want to keep in the
inventory, the most popular shirt in terms of collar size
during festival season.
• Caution: In a few problems in real life, there will be more
than one mode such as bimodal and multi-modal values. In
these cases mode cannot be uniquely determined.
Mode - Example

• The life in number of hours of 10 flashlight batteries are


as follows: Find the mode

• 340 340 350 350 340 340 320 340 330 330

• 340 occurs five times. Hence, mode = 34O.


Comparison of Mean, Median
and Mode
Mean Median Mode

Affected by extreme values. Not affected by extreme values. Not affected by extreme values.

Can be treated algebraically. That is, Cannot be treated algebraically. That Cannot be treated algebraically.
Means of several groups can be is, Medians of several groups cannot That is, Modes of several groups
combined. be combined. cannot be combined.

.
Measures of Dispersion
• In simple terms, of dispersion indicate how large the spread
of the distribution is around the central tendency.
• It answers unambiguously the equation

What is the magnitude of departure from the average


value for different groups having identical averages?”
Range
• Range is the simplest of all the measures of dispersion. It is
calculated as the between maximum
difference minimum value in and
the data set.

Range X Maximum –
=
X Minimum
Range -Example

Example for calculating Range


The following data represents the percentage return
on the investment for the 10 mutual funds per annum.
Calculate the Range.
12, 14, 11, 18, 11.3, 12, 14, 11, 9

Range = X Maximum –X
minimum = 18 - 9 =
9
Inter-Quartile Range(IQR)

• IQR= Range computed on middle 50% of the observations


after eliminating the highest and lowest 25% of observations
in a data set that is arranged ascending order. IQR is less
affected by outliers.

• IQR =Q3-Q1
Interquartile Range-Example

• The following data represents the percentage return on


investment for 9 mutual funds per annum. Calculate
interquartile range.
• Data set: 12, 14, 11, 18, 11.5, 12, 14, 11, 9
• Arranging in ascending order, the data set becomes
9, 11, 11, 11.5, 12, 12, 14, 14, 18
IQR = Q3 – Q1 = 14 – 11 = 3
Standard deviation
• Standard deviation forms the cornerstone for the inferential
statistics.

• To define standard deviation, you need to define


another
term called variance. In simple terms, standard deviation
is the square root of variance.
Example of Standard Deviation
• The following data represent the percentage return on
investment for 10 mutual funds per annum. Calculate the
sample standard deviation.

• 12, 14, 11, 18, 10.5, 11.3, 12, 14, 11, 9


Solution for the example cont.
Form the spreadsheet of the Microsoft excel in the previous slide, it is
easy to see
= 12.28 ( In column A and row14, 12.28 is seen)

Sample variance = = 6.33 ( In column D and row14, 6.33 is seen)

Sample standard deviation = = 2.52 ( In column D and row15,


2.52 is seen)
Histogram( also known as frequency histogram) is a snap shot of the frequency distribution.

Histogram is a graphical representation of the frequency distribution in which the X-axis represents the
classes and the Y-axis represents the frequencies in bars.

Histogram depicts the pattern of the distribution emerging from the characteristic being measured.
The Empirical Rule
• The empirical rule approximates the variation of data in the
bell-shaped distribution.
• Approximately 68% of the data in a bell shaped distribution is
within 1 standard deviation of the mean or
The Empirical Rule
• Approximately 95% of the data is a bell-shaped distribution lies within two
standard deviations of the mean, or

Approximately 99.73% of the data is a bell-shaped distribution lies

within three standard deviations of the mean, or


The five number summary
• The five numbers that help describe the center, spread and
shape of the data are:
 XSmallest
 First Quartile (Q1)
 Median (Q2)
 Third Quartile (Q3)
 XLargest
CASE STUDY - HEALTH INSURANCE

• Most companies are now recognizing the power of data in making crucial
business decisions. For an Insurance company, it becomes more important to
study various attributes about their customers. Leveraging this customer
information to make business decisions can provide a competitive edge to the
Company over other players in the market

• We are provided with some customer data of an Insurance company like age,
gender, BMI and medical charges billed by insurance company. We need to
explore this data to see if we can derive some meaningful insights from this data.
Five number summary and
The Boxplot
• The Boxplot: A graphical display of the data on the five-
number summary:
Five number summary:
Shape of Boxplots
• If data is symmetric around the median then the box and
central line are centered between the endpoints.

• A Boxplot can be shown in either a vertical or horizontal


orientation.
Distribution shape and
The Boxplots
Boxplot Example
• Below is a Boxplot for the following data:

• The data are right skewed, as the plot depicts


Boxplot example showing an outlier
• The Boxplot below of the same data shows the
outlier value of 27 plotted separately.
• A value is considered an outlier if it is more than 1.5 times IQR
between Q1 or above Q3.

You might also like