Session 4 Summary

Download as xlsx, pdf, or txt
Download as xlsx, pdf, or txt
You are on page 1of 47

Descriptive Statistics Descriptive Statistics ~ Example

Descriptive statistics involves summarizing and describing data Let's say a biostatistician is studying the heights of a g
using numerical measures, tables, and graphs. It focuses on individuals. They collect data on the heights of 100 peo
organizing and presenting data in a meaningful way to gain population. To summarize and describe this dat
insights and understand the characteristics of the data set. biostatistician calculates various descriptive statistics suc
Descriptive statistics provide a snapshot of the data and are mean height, median height, range, and standard de
typically used to describe and analyze a sample or a population. They might also create a histogram or a box plot to visua
distribution of heights in the population.
e
ying the heights of a group of
the heights of 100 people in a
nd describe this data, the
descriptive statistics such as the
ange, and standard deviation.
am or a box plot to visualize the
ulation.
Regression includes
ANOVA
Confidence intervals for both the slope and the Y intercept
12.5
3.12

Y hat = 12.5 + 3.12*X


Y hat = 13.6 + 4.25*X

CI 95% the real Y intercept 9.87 to 14.8


the real slope 1.11 to 5.26
Difference: PA - PB or PB - PA
3.50%
0.0350 (decimal format)
2 proportions confidence interval formula

P1 = PB = 34.3% q1 = (1 - p1) 65.7%


P2 = PA = 30.8%

Z critical value for 90%, 95 and


For 95% CI ~ Z = 1.96
99%
Confidence level Z value
90% 1.645 95% CI upper limit 0.159565
95% 1.96 95% CI lower limit -0.089565
99% 2.576
95% CI = [-0.0896 0.1596]
95% CI = [-0.0896 0.1596]

CI difference of population proportions (big samples) online calculator


https://www.statology.org/confidence-interval-difference-in-proportions-calculator/
P-value = P difference = 0.5740331

does it include zero value?


0.0833333
0.05
None

Observed Frequencies PHStat User Note:


Enter replacement labels for the
Column variable Calculations and column variables as well as
Row variable C1 C2 Total fo-fe observed frequency counts in th
table that starts in row 3.
R1 80 180 260 -1.939394 1.9393939
R2 24 46 70 1.9393939 -1.939394 Note: The #DIV/0! error messag
Total 104 226 330 disappear after you enter the
observed frequency counts.

Expected Frequencies (Before continuing, press the De


Column variable key to delete this note.)
Row variable C1 C2 Total (fo-fe)^2/fe
R1 81.939394 178.06061 260 0.0459028 0.0211234
R2 22.060606 47.939394 70 0.1704962 0.0784584
Total 104 226 330

Data
Level of Significance 0.05
Number of Rows 2
Number of Columns 2
Degrees of Freedom 1

Results
Critical Value 3.8414588
Chi-Square Test Statistic 0.3159808
p-Value 0.5740331
Do not reject the null hypothesis

Expected frequency assumption


is met.
User Note:
placement labels for the row
umn variables as well as the
d frequency counts in the
at starts in row 3.

he #DIV/0! error messages will


ar after you enter the
d frequency counts.

continuing, press the Delete


elete this note.)
Inferential Stats Inferential Stats ~ Example
Inferential statistics involves drawing conclusions and making inferences Suppose a pharmaceutical company develop
about a population based on the analysis of a sample. It uses probability condition. They conduct a randomized contro
theory and statistical techniques to generalize findings from a sample to a drug to a randomly selected group of patien
larger population. Inferential statistics helps researchers make predictions, receiving a placebo. After collecting data on
test hypotheses, and make informed decisions about the population using inferential statistics can be used to determine
the information obtained from a representative sample. between the two groups are statistically sign
inferences about the effectiveness of the d
conclusions regarding its efficacy.
ample
utical company develops a new drug to treat a specific medical
uct a randomized controlled trial (RCT) where they administer the
elected group of patients and compare it with a control group
After collecting data on the outcomes and analyzing the results,
an be used to determine if the observed differences in outcomes
ups are statistically significant. This allows researchers to make
effectiveness of the drug in the larger population and draw
its efficacy.
Normal Probability Distribution Normal Probability Distribution

The normal probability distribution, also known as the Gaussian distribution or bell In this formula:
curve, is one of the most important and widely used probability distributions in x represents a random variable.
statistics. It is characterized by its symmetric, bell-shaped curve. μ (mu) is the mean or the center of
σ (sigma) is the standard deviatio
The probability density function (PDF) is a mathematical function that describes the the spread or variability of the dist
shape of the normal distribution. The PDF of the normal distribution is given by the π (pi) is a mathematical constant,
following formula: to 3.14159.
e is the base of the natural loga
equal to 2.71828.

The normal distribution is symmet


(μ), meaning that the probability
on both sides of the mean. The s
determines the width of the di
standard deviation indicates a
curve, while a larger standard d
wider and flatter curve.
robability Distribution Normal Probability Distribution Properties
mula: * The total area under the curve is equal to 1, representing the probability of all
ts a random variable. possible outcomes.
he mean or the center of the distribution.
is the standard deviation, which determines * The mean, median, and mode of the normal distribution are all equal and located at
or variability of the distribution. the center of the distribution.
mathematical constant, approximately equal
. * The curve is symmetric, with half of the area lying to the left of the mean and the
ase of the natural logarithm, approximately other half to the right.
71828.
* The standard deviation controls the spread of the distribution, with about 68% of
al distribution is symmetric around the mean the data falling within one standard deviation of the mean (in the interval [μ - σ, μ +
ng that the probability is evenly distributed σ]), approximately 95% within two standard deviations (in the interval [μ - 2σ, μ +
des of the mean. The standard deviation (σ) 2σ]), and around 99.7% within three standard deviations (in the interval [μ - 3σ, μ +
s the width of the distribution. A smaller 3σ]).
deviation indicates a narrower and taller
le a larger standard deviation results in a The normal distribution has numerous applications in various fields, such as statistics,
flatter curve. economics, engineering, and natural sciences. It is commonly used for modeling
continuous random variables, estimating probabilities, and conducting statistical
inference.
he probability of all

equal and located at

of the mean and the

, with about 68% of


e interval [μ - σ, μ +
nterval [μ - 2σ, μ +
interval [μ - 3σ, μ +

ds, such as statistics,


used for modeling
onducting statistical
Normal Probability distribution Applications in Biostats
Modeling Biological Phenomena: Many biological measurements and characteristics, such as height, weight, blood press
expression levels, tend to follow a normal distribution. By assuming a normal distribution, researchers can model and analyze
studies.

Hypothesis Testing: In biostatistics, hypothesis testing is often used to determine whether a specific treatment or interventi
biological outcome. The normal distribution is frequently used to model the sampling distribution of test statistics, such a
employed to test hypotheses about means, proportions, or other parameters.

Confidence Intervals: The estimation of population parameters, such as means or proportions, is a common task in bio
provide a range of plausible values for a population parameter, along with an associated level of confidence. When sample
distribution is approximately normal, confidence intervals are often based on the normal distribution.

Power and Sample Size Calculations: Power analysis is used in biostatistics to determine the required sample size for a stu
statistical power. Statistical power refers to the ability of a study to detect a true effect when it exists. The normal distributio
variability of the outcome and calculate the necessary sample size for different effect sizes and levels of significance.

Regression Analysis: In regression analysis, the normal distribution plays a crucial role. Linear regression models, for examp
or residuals follow a normal distribution with constant variance. This assumption allows for the calculation of confidence in
the regression coefficients.
ght, weight, blood pressure, enzyme activity, and gene
s can model and analyze these variables in biostatistical

c treatment or intervention has a significant effect on a


of test statistics, such as t-tests and z-tests, which are

s a common task in biostatistics. Confidence intervals


nfidence. When sample sizes are large or the sampling
.

ed sample size for a study to achieve a desired level of


s. The normal distribution is often used to estimate the
of significance.

sion models, for example, often assume that the errors


ulation of confidence intervals and hypothesis tests for
Point Estimate Point Estimate Example
A point estimate is a single value that is used to estimate an unknown population Suppose you want to estimate the
parameter based on sample data. It provides an estimate of the parameter without collect a sample of 100 individua
accounting for the uncertainty or variability associated with the estimate. A point sample mean serves as a point es
estimate is a best guess or approximation of the true population value based on the
available sample information.

A point estimate is a single value used to estimate a population parameter.

Interval Estimate Interval Estimate Example


An interval estimate, also known as a confidence interval, is a range of values that is an interval estimate could be a
used to estimate an unknown population parameter based on sample data. It takes population. For instance, a 95%
into account the variability and uncertainty associated with the estimate by providing a indicating that there is a 95% ch
range of plausible values rather than a single point. The interval estimate provides a lies between 165 cm and 175 cm
level of confidence that the true population parameter falls within the given range. confidence.
An interval estimate provides a range of values that estimate the parameter,
accounting for uncertainty and variability.
imate Example
you want to estimate the average height of adults in a certain population. You
sample of 100 individuals and calculate the mean height of the sample. This
ean serves as a point estimate of the population mean height.

Estimate Example
al estimate could be a confidence interval for the average height of the
n. For instance, a 95% confidence interval might be (165 cm, 175 cm),
g that there is a 95% chance that the true average height of the population
een 165 cm and 175 cm based on the sample data and the chosen level of
ce.
Xi Calculate the sample average
-5
0
-10
-15
5
-5
-5
0
5
-10
Estimation of μ Estimation of μ ~ wh
Estimation of μ refers to the process of estimating or determining the unknown population mean In practical terms, e
(μ) based on sample data. In statistics, μ represents the true average or mean of a particular population based on
variable in the entire population. the average income o
their incomes, and us
When we have a sample from a population, we can use statistical methods to estimate the
population mean. Estimating μ involves using the sample mean (x̄) as a point estimate of the What should we con
population mean. The sample mean is calculated by summing the values of the observations in the
sample and dividing it by the sample size. It's important to note
samples from the sa
The process of estimation involves using the sample mean as an approximation or best guess of the confidence intervals c
unknown population mean. The quality of the estimate depends on the representativeness and and provide a range o
size of the sample, as well as the variability within the population.
Estimation of μ ~ what is the purpose?

In practical terms, estimating μ allows us to make inferences about the characteristics of a


population based on the information available in the sample. For example, if we want to estimate
the average income of all adults in a country, we can take a random sample of individuals, calculate
their incomes, and use the sample mean as an estimate of the population mean income.

What should we consider?


It's important to note that the estimation is subject to sampling variability, meaning that different
samples from the same population will yield slightly different estimates. That's the reason why
confidence intervals can be constructed to quantify the uncertainty associated with the estimate
and provide a range of plausible values for the population mean.
Central Limit Theorem Central Limit Theorem ~ why is widely
The Central Limit Theorem (CLT) is a fundamental result in statistics that states The central limit theorem is widely use
that when independent random variables are added together, their sum tends allows us to make assumptions about th
to follow a normal distribution, regardless of the shape of the original hypothesis tests, construct confidence i
variables' distributions, under certain conditions. even when the original population distribu

More specifically, the central limit theorem states that the sampling A common rule of thumb
distribution of the sum (or average) of a large number of independent and
identically distributed (i.i.d.) random variables approaches a normal It is important to note that the central lim
distribution as the sample size increases, regardless of the shape of the sample sizes. The exact conditions fo
original population distribution. characteristics of the underlying populatio
rule of thumb is that a sample size of a
In simpler terms central limit theorem to be applicable.
In simpler terms, the central limit theorem tells us that if we take many
samples of a certain size from any population (regardless of its distribution
shape), calculate the means of those samples, and plot a histogram of those
means, the resulting distribution will be approximately normal.
eorem ~ why is widely used?
theorem is widely used in statistical inference because it
e assumptions about the sampling distribution and perform
construct confidence intervals, and estimate parameters,
ginal population distribution is unknown or non-normal.

f thumb

note that the central limit theorem holds for sufficiently large
e exact conditions for its application depend on the
he underlying population distribution, but a commonly used
that a sample size of at least 30 is often sufficient for the
em to be applicable.
Online calculator

1)
2)
3)
4)
Online calculator https://onlinestatbook.com/2/calculators/normal_dist.html

Using table from - infinity to Z


0.5
0.975
0.025
0.95
95% C.I. for μ How do we interpret this C.I.?
0.04 3.96
It means that we are 95% confident that the true population mean (μ) lies
based on the sample data and statistical calculations.
In other words, a 95% confidence interval [0.04, 3.96] for the estimate o
based on the sample data and statistical calculations, we are 95% confid
population mean falls within this range.

What can we learn from a C.I.?


Confidence Level: The confidence level associated with the interval is 95%. This means that if we were t
sampling process multiple times and construct 95% confidence intervals, approximately 95% of those int
capture the true population mean.

Range of Plausible Values: The interval [0.04, 3.96] provides a range of plausible values for the populatio
suggests that, with 95% confidence, the true population mean falls somewhere between 0.04 and 3.96.

Precision of the Estimate: The width of the confidence interval reflects the precision of the estimate. In t
interval has a width of 3.92 (3.96 - 0.04), indicating that the estimate is relatively imprecise. A narrower in
indicate a more precise estimate.

No Guarantee about a Specific Value: It's important to note that the confidence interval does not make
statement about the true population mean. It provides a range of plausible values within which the tru
mean is likely to lie, but it does not single out a specific value.
opulation mean (μ) lies within this interval
.
3.96] for the estimate of μ suggests that,
ons, we are 95% confident that the true

means that if we were to repeat the


mately 95% of those intervals would

values for the population mean μ. It


ween 0.04 and 3.96.

ion of the estimate. In this case, the


mprecise. A narrower interval would

interval does not make a definitive


es within which the true population
Relationship between CI and Sample Size Relationship between
Inverse Relationship: Generally, there is an inverse relationship between the width of a It's important to note t
confidence interval and the sample size. As the sample size increases, the width of the not linear. The effect
confidence interval tends to decrease. In other words, larger sample sizes tend to result in diminishes as the samp
narrower confidence intervals. the sample size have a m

Increased Precision: A larger sample size provides more information about the population,
leading to increased precision in estimating the parameter of interest. With a larger sample, the
estimate of the population parameter becomes more reliable, reducing the variability and
resulting in a narrower confidence interval.

More Certainty: With a larger sample size, there is a higher level of confidence in the estimate
and narrower confidence intervals. This means that the range of plausible values for the
population parameter becomes more focused, providing a higher level of certainty about where
the true parameter lies.

Trade-off with Cost and Resources: While increasing the sample size generally leads to
narrower confidence intervals and increased precision, it often requires more resources, time,
and effort. Collecting a larger sample may involve increased costs and logistical challenges.
Therefore, the decision regarding sample size should consider the trade-off between precision
and the available resources.
Relationship between CI and Sample Size
It's important to note that the relationship between the confidence interval and sample size is
not linear. The effect of increasing the sample size on the width of the confidence interval
diminishes as the sample size becomes larger. For very large sample sizes, additional increases in
the sample size have a minimal impact on further narrowing the confidence interval.

Length of interval Standardized Precision


1.24 1
0.392 3.1612903
0.124 10
Hypothesis Testing Null Hypothesis
Hypothesis testing is a statistical method used to make inferences or draw The null hypothesis (H₀) is a statement of
conclusions about a population based on sample data. It involves relationship between variables in the p
formulating two competing hypotheses, known as the null hypothesis (H₀) assumption that any observed differences o
and the alternative hypothesis (H₁ or Ha), and then assessing the evidence due to random chance or sampling variabil
provided by the data to determine which hypothesis is more likely. true.

Alternative or alternate hypothesis


The alternative hypothesis (H₁ or Ha) contr
represents the claim or theory we are try
presence of an effect, difference, or relations
(H₀) is a statement of no effect, no difference, or no
en variables in the population. It represents the
observed differences or associations in the sample are
nce or sampling variability. It is always assumed to be

nate hypothesis
othesis (H₁ or Ha) contradicts the null hypothesis and
m or theory we are trying to support. It asserts the
, difference, or relationship in the population.
Hypothesis Testing Procedure
The hypothesis testing process involves the following steps:

1 State the hypotheses: Clearly specify the null hypothesis (H₀) and the alternative
hypothesis (H₁).
2 Set the significance level (α): Determine the level of significance, often denoted
as α, which represents the probability of rejecting the null hypothesis when it is
true. Commonly used values for α include 0.05 (5%) and 0.01 (1%).
3 Collect and analyze the data: Gather a sample of data and perform the appropriate
statistical analysis to obtain test statistics or p-values.
4 Determine the critical region: Based on the chosen significance level (α), determine
the critical region or rejection region of the test statistic. This is the range of values
that would lead to the rejection of the null hypothesis.
5 Calculate the test statistic: Compute the test statistic based on the data and the
chosen statistical test (e.g., t-test, chi-square test, etc.).
6 Make a decision: Compare the test statistic to the critical value(s) or use the p-
value to make a decision. If the test statistic falls in the critical region or the p-
value is less than α, reject the null hypothesis. Otherwise, fail to reject the null
hypothesis.
7 Draw conclusions: Based on the decision made, interpret the results and draw
conclusions about the population based on the evidence provided by the data.
What is the P-value? What is the level of significance?
"The p-value is the probability of observing the data, or more extreme data, The level of significance, denoted as α (alpha
assuming that the null hypothesis is true." used in hypothesis testing to determine th
hypothesis. It represents the maximum prob
In simpler terms..... which is the incorrect rejection of the null
the p-value quantifies the strength of evidence against the null hypothesis. true.
It tells us how likely it is to obtain the observed data or more extreme data if
the null hypothesis is true. The most commonly used levels of significan
Choosing a higher level of significance incre
A smaller p-value indicates stronger evidence against the null hypothesis, the null hypothesis, while choosing a lowe
suggesting that the observed data is unlikely to occur by chance alone. stronger evidence to reject the null hypothes
That's why we reject it when the p-value is < level of signficance.

level of significance = 0.05


p-value = 0.82

We reject Ho when p/value < level of significance


We do not reject Ho if pvalue is > level of significance
significance?
nce, denoted as α (alpha), is a pre-determined threshold
testing to determine the criteria for rejecting the null
ents the maximum probability of making a Type I error,
ect rejection of the null hypothesis when it is actually

used levels of significance are 0.05 (5%) and 0.01 (1%).


evel of significance increases the likelihood of rejecting
while choosing a lower level of significance requires
reject the null hypothesis.
1)
Ho: μ = 0
Ha: μ ≠ 0 increase or decrease

2) Z= 2
2

one sided p-value


P( Z>2) = 1 - p(Z <=2) = 0.0228

two sided p-value


2*P( Z>2) = 2*(1 - p(Z <=2)) = 0.0456
Confidence Interval Excluding Ze
Let's consider a study that investiga
collects data from a sample of indiv
difference in blood pressure before
1.8).

Since the confidence interval doe


significant difference in blood press

On the other hand, the interval ind


lies between -5.2 and -1.8 units.

Confidence Interval Including Ze


Suppose another study examines t
researcher collects data from a sam
difference in cholesterol levels befo
Ho: U1 = U2 Ho: U1 - U2 = 0 There will be no difference is (-0.8, 1.2). In this case, the con
Ha: U1 ≠ U2 Ha: U1 - U2 ≠ 0 There will be differences enough evidence to conclude a st
and after taking the supplement.
On the other hand, the interval su
could be zero or very close to zero.

U = 25
Sample 1
e Interval Excluding Zero ~ Example
er a study that investigates the effect of a new drug on blood pressure. A researcher
a from a sample of individuals and calculates a 95% confidence interval for the mean
n blood pressure before and after taking the drug. The confidence interval is (-5.2, -

confidence interval does not include zero, it suggests that there is a statistically
difference in blood pressure before and after taking the drug.

er hand, the interval indicates that, with 95% confidence, the true mean difference
n -5.2 and -1.8 units.

e Interval Including Zero ~ Example


nother study examines the effect of a dietary supplement on cholesterol levels. The
collects data from a sample and calculates a 90% confidence interval for the mean
n cholesterol levels before and after taking the supplement. The confidence interval
2). In this case, the confidence interval includes zero, indicating that there is not
idence to conclude a statistically significant difference in cholesterol levels before
aking the supplement.
er hand, the interval suggests that, with 90% confidence, the true mean difference
ro or very close to zero.

x - bar = 25.3
Why some researchers suggest confidence intervals should be preferred over 1 Simplicity and Interpretability
statistical tests? Confidence intervals provide a m
values and statistical tests. They
Some researchers suggest that confidence intervals should be preferred over statistical parameter of interest. This make
tests due to several reasons: communicate the uncertainty ass

1) Simplicity and Interpretability.


2) Complete Information.
3) Avoiding Arbitrary Cutoffs. 2 Complete Information
4) Overemphasis on Statistical Significance. Confidence intervals provide mor
5) Replicability and Reproducibility. both the point estimate and the
effect but also the precision of t
informed decisions and drawing m

3 Avoiding Arbitrary Cutoffs


Confidence intervals avoid the re
hypothesis testing. By presenting
nuanced assessment of the evide
estimate without the need for dic

4 Overemphasis on Statistical S
Statistical tests based on p-values
(significant or not significant) an
the other hand, provide a mor
enabling a more balanced conside
city and Interpretability 5 Replicability and Reproducibility
ce intervals provide a more intuitive and straightforward interpretation compared to p- Confidence intervals encourage replicatio
nd statistical tests. They directly estimate the range of plausible values for the population indication of the range of values that f
er of interest. This makes it easier for researchers and decision-makers to understand and comparison and synthesis of research findi
cate the uncertainty associated with the estimate.

In summary
It's important to note that confidence inte
ete Information provide complementary information. How
ce intervals provide more complete information about the parameter estimate by including
point estimate and the associated uncertainty. They convey not only the magnitude of the their ability to offer a more comprehe
associated uncertainty.
t also the precision of the estimate. This additional information can be valuable in making
decisions and drawing meaningful conclusions.

ng Arbitrary Cutoffs
ce intervals avoid the reliance on arbitrary significance levels (e.g., α = 0.05) often used in
is testing. By presenting a range of plausible values, confidence intervals allow for a more
assessment of the evidence. Researchers can evaluate the magnitude and precision of the
without the need for dichotomous decisions based on arbitrary thresholds.

mphasis on Statistical Significance


tests based on p-values have been criticized for leading to a binary interpretation of results
nt or not significant) and overemphasizing statistical significance. Confidence intervals, on
r hand, provide a more continuous and informative representation of the uncertainty,
a more balanced consideration of the findings.
d Reproducibility
als encourage replication and reproducibility of results. They provide a clearer
range of values that future studies should aim to replicate, facilitating better
nthesis of research findings.

ote that confidence intervals and statistical tests serve different purposes and can
entary information. However, the preference for confidence intervals stems from
ffer a more comprehensive and interpretable summary of the data and the
ainty.

You might also like