Stat Chapter 5-9

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Chapter Five

Measures of Central Tendency


5.1. Introduction

In this section we introduce some statistics that are used for describing the center of a set
of data values. To begin, suppose that we have a data set consisting of the n numerical
values x1, x2, . . . , xn, then we will find different value for this data set

5.2. Mathematical Measures

A. Mean
There are many type of mean such as arithmetic mean, geometric mean, harmonic mean etc.

Arithmetic mean: It the sum of all observation divided by the total number. Let x1, x2, x3,
x4, … xn be set of data then arithmetic mean becomes

Geometric Mean: It is denoted by G.M. If x1, x2, …, xn are given data then the geometric
mean GM given by

Page 1 of 32
Page 2 of 32
5.3 Properties of Mean, Mode, and Median

Mean
1. It is the arithmetic average of the measurements in a data set.
2. There is only one mean for a data set.
3. Its value is influenced by extreme measurements.
4. Means of subsets can be combined to determine the mean of the complete data set.
5. It is applicable to quantitative data only.

Median
1. It is the central value; 50% of the measurements lie above it and 50% fall below it.
2. There is only one median for a data set.
3. It is not influenced by extreme measurements.
4. Medians of subsets cannot be combined to determine the median of the complete data set.
5. For grouped data, its value is rather stable even when the data are organized into different
categories.
6. It is applicable to quantitative data only.

Mode
1. It is the most frequent or probable measurement in the data set.
2. There can be more than one mode for a data set.
3. It is not influenced by extreme measurements.
4. Modes of subsets cannot be combined to determine the mode of the complete data set.
5. For grouped data its value can change depending on the categories used.
6. It is applicable for both qualitative and quantitative data.

5.4. Positional Measures

In addition to measures of central tendency and measures of variation, there are measures of position
or location. These measures include standard scores, percentiles, deciles, and quartiles. They are
used to locate the relative position of a data value in the data set. For example, if a value is located at
the 80th percentile, it means that 80% of the values fall below it in the distribution and 20% of the
values fall above it. The median is the value that corresponds to the 50th percentile, since one-half of
the values fall below it and one half of the values fall above it. This section discusses these measures
of position.

A) Standard scores

There is an old saying, “You can’t compare apples and oranges.” But with the use of statistics, it can
be done to some extent. Suppose that a student scored 90 on a music test and 45 on an English exam.
Direct comparison of raw scores is impossible, since the exams might not be equivalent in terms of
number of questions, value of each question, and so on. However, a comparison of a relative standard
similar to both can be made. This comparison uses the mean and standard deviation and is called a
standard score or z score.

Page 3 of 32
A standard score or z score tells how many standard deviations a data value is above or below the
mean for a specific distribution of values. If a standard score is zero, then the data value is the same
as the mean. A z score or standard score for a value is obtained by subtracting the mean from the
value and dividing the result by the standard deviation. The symbol for a standard score is z.

B).Percentiles

Percentiles are position measures used in different fields to indicate the position of an individual
in a group. Percentiles divide the data set into 100 equal groups
Percentiles are not the same as percentages. That is, if a student gets 72 correct answers out of a
possible 100, she obtains a percentage score of 72. There is no indication of her position with
respect to the rest of the class. She could have scored the highest, the lowest, or somewhere in
between. On the other hand, if a raw score of 72 corresponds to the 64th percentile, then she did
better than 64% of the students in her class.
Percentiles are symbolized by P1, P2, P3, . . . , P99 and divide the distribution into 100 groups.

Page 4 of 32
Example1: A teacher gives a 20-point test to 10 students. The scores are shown here. Find
the percentile rank of a score of 12. 18, 15, 12, 6, 8, 2, 3, 5, 20, 10
Solution:
Arrange the data in order from lowest to highest.
2, 3, 5, 6, 8, 10, 12, 15, 18, 20 ,then substitute into the formula.

Example 2: Using the scores in Example 3–32, find the value corresponding to the 25th
percentile.
Solution:
Step 1 Arrange the data in order from lowest to highest.
2, 3, 5, 6, 8, 10, 12, 15, 18, 20

Page 5 of 32
Example3:Using the data set in example 1, find the value that corresponds to the 60th percentile.

Solution:
Step 1 Arrange the data in order from smallest to largest.
2, 3, 5, 6, 8, 10, 12, 15, 18, 20
Step 2 Substitute in the formula.

Page 6 of 32
Finding Data Values Corresponding to Q1, Q2, and Q3
Step 1: Arrange the data in order from lowest to highest.
Step 2: Find the median of the data values. This is the value for Q2.
Step 3: Find the median of the data values that fall below Q2. This is the value for Q1.
Step 4: Find the median of the data values that fall above Q2. This is the value for Q3.

Example: Find Q1, Q2, and Q3 for the data set 15, 13, 6, 5, 12, 50, 22, 18.

Step 1 Arrange the data in order.


5, 6, 12, 13, 15, 18, 22, 50

5.5. Symmetrical and Skewed Distributions


Frequency distributions can assume many shapes. The three most important shapes are
positively skewed, symmetric, and negatively skewed. In a positively skewed or right-skewed
distribution, the majority of the data values fall to the left of the mean and cluster at the lower
end of the distribution; the “tail” is to the right. Also, the mean is to the right of the median, and
the mode is to the left of the median.

For example, if an instructor gave an examination and most of the students did poorly, their
scores would tend to cluster on the left side of the distribution. A few high scores would
constitute the tail of the distribution, which would be on the right side. Another example of a
positively skewed distribution is the incomes of the population of the United States. Most of the
incomes cluster about the low end of the distribution; those with high incomes are in the minority
and are in the tail at the right of the distribution

Page 7 of 32
In a symmetric distribution, the data values are evenly distributed on both sides of the mean.
In addition, when the distribution is unimodal, the mean, median, and mode are the same and are
at the center of the distribution. Examples of symmetric distributions are IQ scores and heights
of adult males.

When the majority of the data values fall to the right of the mean and cluster at the upper end of
the distribution, with the tail to the left, the distribution is said to be negatively skewed or left
skewed. Also, the mean is to the left of the median, and the mode is to the right of the median.
As an example, a negatively skewed distribution results if the majority of students score very
high on an instructor’s examination. These scores will tend to cluster to the right of the
distribution.
When a distribution is extremely skewed, the value of the mean will be pulled toward the tail,
but the majority of the data values will be greater than the mean or less than the mean
(depending on which way the data are skewed); hence, the median rather than the mean is a
more appropriate measure of central tendency. An extremely skewed distribution can also affect
other statistics.

Page 8 of 32
Chapter Six
Measures of Variation

Just as measures of central tendency locate the “center” of a relative frequency distribution,
measures of variation measure its “spread”.
The most commonly used measures of data variation are the range, the variance, standard
deviation and coefficient of variation.

A. Range

Definition: The range of a quantitative data set is the difference between the largest and
smallest values in the set.
Range = Maximum - Minimum,
where, maximum = largest value, minimum = smallest value.

Example: Find the range of the following data set


52, 56, 84, 99, 21, 10, 11, 67, 68, 21, 99 and 100

Solution: Range = Maximum – Minimum; Range = 100-10 = 90

B. Variance and Standard Deviation

Definition:The population variance of the population of the observations x is defined the


formula

Where: 𝛅2 = population variance, xi = t he item or observation, = population mean, N = total


number of observations in the population.

Page 9 of 32
C. Coefficient Variation

Definition: The coefficient of variation of a data set is the relation of its standard deviation to its
mean
Coefficient of variation (CV) =

This definition is applied to both population and sample. The unit of the coefficient of variation is
percent.

Exercise: Given the following data sets


A) 12, 6, 7, 3, 15
B) 9, 3, 8, 8
Find the following for each data set
a) Range of data set
b) Mean of data set
c) Variance of data set
d) Standard deviation for each
e) Coefficient of variation.

Page 10 of 32
Chapter Seven
Correlation and Simple Linear Regression
7.1. Correlation

Correlation Coefficient, R

• R is a measure of strength of the linear association between two variables, x and y.


• Most statistical packages and some hand calculators can calculate R
• For the data in our Example R=0.94
• R has some unique characteristics
• R takes values between -1 and +1
• R=0 represents no linear relationship between the two variables
• R>0 implies a direct linear relationship
• R<0 implies an inverse linear relationship
• The closer R comes to either +1 or -1, the stronger is the linear relationship

Coefficient of Determination, R2
• R2 is another important measure of linear association between x and y (0≤ R2 ≤ 1)
• R2 measures the proportion of the total variation in y which is explained by x

Difference between Correlation and Regression

Correlation Coefficient, R, measures the strength of bivariate association


The regression line is a prediction equation that estimates the values of y for any given x

Limitations of the correlation coefficient

• Though R measures how closely the two variables approximate a straight line, it does not
validly measures the strength of nonlinear relationship

• When the sample size, n, is small we also have to be careful with the reliability of the
correlation

• Outliers could have a marked effect on R


• Causal Linear Relationship

7.2. Regression
Regression analysis is a statistical technique that is very useful for exploring the relationships
between two or more variables.

7.2.1 The Simple Linear Regression Model


The case of simple linear regression considers a single regressor or predictor x and a
dependent or response variable Y.

Page 11 of 32
Suppose that the true relationship between Y and x is a straight line and that the observation Y at
each level of x is a random variable.

Statistical model for simple linear regression model is given by

7.2.1.1 Least Squares and the Fitted Model

Page 12 of 32
Example: Fit the simple linear regression model for the following data

Page 13 of 32
Solution:

Example: Given the following summary

Page 14 of 32
Solution:

Page 15 of 32
Chapter Eight
Introduction to Elementary Probability

1.1. Counting methods

Definition: If n is a natural number, the symbol n! (read \n factorial") denotes the product of all
the natural numbers from n down to 1. If n = 1, this formula is understood to give 1! = 1

The definition of n! could be used to show that n*[(n-1)]! = n! for all natural numbers n ≥1. So for
example, 7! = 7 * 6!:
This fact can be extended to see that n! = n * (n - 1) * [(n - 2)]!; for all natural numbers n ≥ 2;
Or n! = n * (n - 1) * (n - 2) * [(n - 3)]!; for all natural numbers n ≥ 3; and so on.

For example, 10! = 10 *9 * 8 * 7! = 10 * 9 * 8 * 7 * 6 * 5!:

Permutation

Page 16 of 32
Suppose 4 pictures are to be arranged from left to right on one wall of an art gallery. How many
arrangements are possible? Using the multiplication principle, there are 4 ways of selecting the
first picture. After the first picture is selected, there are 3 ways of selecting the second picture.
After the first 2 pictures are selected, there are 2 ways of selecting the third picture. And after the
first 3 pictures are selected, there is only 1 way to select the fourth. Thus, the number of
arrangements possible for the 4 pictures is

In general, we refer to a particular arrangement, or ordering, of n objects without repetition as a


permutation of the n objects. How many permutations of n objects are there? From the reasoning
above, there are n ways in which the first object can be chosen, there are n - 1 ways in which the
second object can be chosen, and so on. Applying the multiplication principle, we have Theorem
1:

Theorem 1: Permutations of n Objects

The number of permutations of n objects, denoted by Pn,n, is given by


Pn,n= n *(n -1) *. . . * 1 = n!
Now suppose the director of the art gallery decides to use only 2 of the 4 available pictures on the
wall, arranged from left to right. How many arrangements of 2 pictures can be formed from the 4?
There are 4 ways the first picture can be selected.
After selecting the first picture, there are 3 ways the second picture can be selected. Thus, the number
of arrangements of 2 pictures from 4 pictures, denoted by P4,2, is given by

Page 17 of 32
Example1: From a committee of 8 people, in how many ways can we choose a chair and a vice
chair, assuming one person cannot hold more than one position?
Solution:We are actually asking for the number of permutations of 8 objects taken 2 at a
time that is, P8,2:

Combination:

A combination of a set of n objects taken r at a time is an r-element subset of the n objects. The
number of combinations of n objects taken r at a time, 0 ≤r ≤ n, denoted by Cn,r, can be obtained
by solving for Cn,r in the relationship

Page 18 of 32
Page 19 of 32
1.2. Definition of Probability

Probability is:
 A quantitative measure of uncertainty
 A measure of the strength of belief in the occurrence of an uncertain event
 A measure of the degree of chance or likelihood of occurrence of an uncertain
event
 Measured by a number between 0 and 1 (or between 0% and 100%)

 p(A ) = P(A n B) + P(A n B')


 P(B) = P(A n B) + P(A' n B)

Example: If P(A) = 0.3, P(B) = 0.2 and P( A n B) = 0.1, then determine

Solution
a) P(A') = 1- P(A) = 0.7
b) P ( A ∪B ) = P(A) + P(B) - P( A ∩B ) = 0.3+0.2 - 0.1 = 0.4
c) P( A′ ∩B ) + P( A ∩B ) = P(B). Therefore, P( A′∩B ) = 0.2 - 0.1 = 0.1
d) P(A) = P( A ∩B ) + P( A ∩B′ ). Therefore, P( A ∩B′ ) = 0.3 - 0.1 = 0.2
e) P(( A ∪ B )') = 1 - P( A ∪B ) = 1 - 0.4 = 0.6
f) P( A′ B ) = P(A') + P(B) - P( A′ ∩B ) = 0.7 + 0.2 - 0.1 = 0.8

Page 20 of 32
1.3 Conditional Probability

• Conditional Probability - Probability of A given B


P( A B)  P( A B) , where P(B)  0
P(B)

This definition can be understood in a special case in which all outcomes of a random
experiment are equally likely.

Example 2.5A day’s production of 850 manufactured parts contains 50 parts that do not meet
customer requirements. Two parts are selected randomly without replacement from the batch.
What is the probability that the second part is defective given that the first part is defective?
Let A denote the event that the first part selected is defective, and let B denote the event that the
second part selected is defective. The probability needed can be expressed as P(B/A) If the first
part is defective, prior to selecting the second part, the batch contains 849
parts, of which 49 are defective, therefore.

What will happen continuing the previous example, if three parts are selected at random, what is
the probabilitythat the first two are defective and the third is not defective?P(ddn) =0.0032 how?

Page 21 of 32
1.3 Multiplication and Total Probability rule

The law of total probability:

Page 22 of 32
1.4. Addition rule
Joint events are generated by applying basic set operations to individual events. Unions of
events, such as AUB; intersections of events, such as A∩B; and complements of events, such as
,A’ are commonly of interest. The probability of a joint event can often be determined from the
probabilities of the individual events that comprise it. Basic set operations are also sometimes
helpful in determining the probability of a joint event. In this section the focus is on unions of
events.

The probability of A or B is interpreted as P(AUB) and that the following general addition rule
applies.

EXAMPLE:

The wafers classified as either in the“center’’ or at the “edge’’ of the sputtering tool that was
used in manufacturing, and by thedegree of contamination. Table 2-2 shows the proportion of
wafers in each category.

a).What is the probability that a wafer was either at the edge or that it contains four or more

particles?

b).What is the probability that a wafer contains less than two particles or that it is both at the

edge and contains more than four particles?

Solution
a). Let E1 denote the event that a wafer contains four or more particles, and
Let E2 denote the event that a wafer is at the edge.
The requested probability is P(E1UE2), Now P(E1) = 0.15 and P(E2) = 0.28, Also, from the table, P(E 1∩E2 ) = 0.04.
Therefore, using Equation 2-1, we find that

b). Find the answer of b?

Page 23 of 32
Chapter Nine
Continuous Probability Distributions

A continuous random variable is one that can assume an uncountable (infinite) number of
values.
We cannot list the possible values because there are an infinite number of them. Because there is
an infinite number of values, the probability of each individual value is virtually 0.
Probability density function f(x) can be used to describe the probability distribution of a
continuous random variable X. if an interval is likely to contain a value for X, its probability is
large and it corresponds to large values for f(x). The probability that X is between a and b is
determined as the integral of f(x) from a to b.
Definition: - For a continuous random variable X, a probability density function is a function
such that

If X is a continuous random variable, for any x1 and x2

Example: Suppose that f(x) = x /8 for 3<x<5. Then answer the following question.

a) Verify that f(x) is density function?


b) Find p( x<4)
c) Find p(4<x<5)
d) Find p(x<3.5 or x>4.5)

Page 24 of 32
Example: Let the continuous random variable X denote the diameter of a hole drilled in a sheet
metal component. The target diameter is 12.5 millimeters. Most random disturbances to the
process result in larger diameters. Historical data show that the distribution of X can be modeled
by a probability density function.

If a part with a diameter larger than 12.60 millimeters is scrapped, what proportion of parts is
scrapped?

Exercise: suppose that x is a continuous random variable whose probability density function is given
by

a) What is the value of C?

b) Find P(x>1)
c) Find p( 0<x<1)
d) Find p(x<0)

Page 25 of 32
9.1 Cumulative distribution functions

Definition: The cumulative distribution function of a continuous random variable X is

Example: The time until a computer software process is complete (in milliseconds) is approximated
by the cumulative distribution function. Determine the probability density function of X. What
proportion of process is complete within 200 milliseconds?

Solution: Using the result that the probability density function is the derivative of the F(x), we
obtain

The probability that a process completes within 200 milliseconds is

9.2 Mean and Variance of a continuous random variable


The mean and variance of a continuous random variable are defined similarly to a discrete
random variable. Integration replaces summation in the definitions.

Definition: - Suppose X is a continuous random variable with probability density function f(x)
The mean or expected value of X and variance denoted as  and 
2
or E(X), is

Page 26 of 32
Example:- Let the continuous random variable X denotes the current measured in a thin copper
wire in milliamperes. Assume that the range of X is [0, 20 mA], and assume that the probability
density function of X is f(x) = 0.05 for 0 . Find the mean & variance of the function?

Solution: - the mean of X is

The variance of X is

9.3 Types of continuous Distribution


9.3.1 Exponential Distribution

Page 27 of 32
It is important to use consistent units in the calculation of probabilities, means, and variances
involving exponential random variables.

Example:- The time between calls to a software developing business is exponentially distributed
with a mean time between calls of 15 minutes.
a) What is the probability that there are no calls within a 30- minute interval?
b) What is the probability that at least one call arrives within a 10-minute interval?
c) What is the probability that the first call arrives within 5 and 10 minutes after opening?
d) Determine the length of an interval of time such that the probability of at least one call in the
interval is 0.90.

Solution: - Let X denote the time until the first call. Then, X is exponential and

Page 28 of 32
9.3.2 Normal Distribution
The normal distribution is the most important of all probability distributions. The probability
density function of a normal random variable is given by:

Page 29 of 32
Example:- Assume that the current measurements in a strip of wire follow a normal distribution
with a mean of 10 milliamperes and a variance of 4 (milliamperes)2. What is the probability that
a measurement exceeds 13 milliamperes?

Solution: Let X denote the current in milliamperes. The requested probability can be represented
as p (x > 13). This probability is shown as the shaded area under the normal probability density
function in the next figure

Page 30 of 32
Creating a new random variable by this transformation is referred to as standardizing. The
random variable Z represents the distance of X from its mean in terms of standard deviations. It
is the key step to calculate a probability for an arbitrary normal random variable.

Example:- The following calculations are shown pictorially below. In practice, a probability is
often rounded to one or two significant digits.

Page 31 of 32
Page 32 of 32

You might also like