BioStats and Epidemiology BNB Notes
BioStats and Epidemiology BNB Notes
BioStats and Epidemiology BNB Notes
Biostatistics and
Epidemiology
2023
Based on videos by Boards and Beyond and content from the internet
2
Contents
The Gaussian Distribution 4
Central Tendency 4
Measures of Dispersion 6
Standard Deviation - Population 6
Standard Deviation - Sample 7
Variance 8
Z score 8
The True Mean of the Population 9
Standard error of the mean 9
Confidence interval 10
Hypothesis testing 11
The P value 12
Power 13
Type 1 error (α error) 13
Type 2 error (β error) 13
Tests of Significance 14
The P value 14
Choosing a test or Data Types 14
T-test 14
Analysis of Variance (ANOVA) 16
Chi-square test 16
Comparing 2 groups 16
Significance of an odds ratio 16
Correlation 17
Pearson coefficient 17
Coefficient of Determination 18
Study Designs 19
Cross-sectional study 19
Case Series 19
Cohort studies 19
Case-Control Study 20
Risk Quantification 22
Bias 32
1. Selection Bias 32
2. Measurement Bias 32
3. Confounding Bias 33
Minimizing Bias 34
4. Crossover studies 34
Effect Modification 34
Clinical Trials 36
Evidence-Based Medicine 38
4
Central Tendency
1. Mean: Sum/N
2. Median: Central value
a. Odd elements: central element
b. Even elements: mean of the central 2 elements
3. Mode:
a. Value with the highest frequency
b. The highest point on the graph
● In case the distribution is symmetrical, Mean, Median and Mode are all the same
● Above diagram shows a symmetrical Distribution
● Asymmetrical distributions → skewed
Measures of Dispersion
1. Standard deviation
2. Z score
3. Variance
Variance
● Variance in statistics is the average squared distance between the data
points and the mean.
● Because it uses squared units rather than the natural data units, the
interpretation is less intuitive.
● Higher values indicate greater variability, but there is no intuitive interpretation for
specific values.
Z score
● Defined for a single data point
● It is the number of SDs away from the mean that data point is
9
Confidence interval
● The range in which 95% of repeated measurements would be expected to fall
95% confidence
What range will you expect the newest point to fall in with 95% accuracy?
OR
95% of new data points will fall into which range?
Example
Range in which 95% of measurements in a dataset fall = Mean +/- 2SD
Range in which true population mean likely falls = Confidence interval of the mean
11
Hypothesis testing
Do study results represent the population reality?
Or
The study found this, is it true universally or did we get this result by chance?
The null hypothesis can be easily rejected if there is a vast difference between the 2 sets of data
2. Scatter of data
3. Number of subjects
Power = 1- β
The P value
The probability that the null hypothesis is correct
It represents the probability of a false positive
A lower than 5% probability lets us reject the null hypothesis
13
Power
Chance of finding a difference when one exists
We always want a high power
Power is a value decided while designing the study
Determined by:
1. Number of subjects
2. Difference in means
3. Scatter of data
Tests of Significance
Is the difference between these two groups significant?
OR
Are the differences we perceive real or a fluke?
OR
What is the probability that these differences are not real?
The P value
The probability that the null hypothesis is correct
It represents the probability of a false positive
A lower than 5% probability lets us reject the null hypothesis
T test
Compares 2 mean values
Outputs a t value
t value is the Z score of the sample in a population (the samples would form a normal
distribution) corrected for the degrees of freedom (n-1) it has
We check the probability of getting this z score
If it is less than 5, it would be incredibly unlikely for us to get this value
So our original assumption of the population mean is probably wrong
Increasing the number of patients may increase power and decrease the p-value
3. The same variable for two different groups (comparing the direction of difference. Eg: is
A less than B by this magnitude or more? One tailed T test)
a. Null hypothesis is that they are the same
16
b. Formula as above
4. Paired T test (the same population, different values of the same period before and after)
a.
Comparing 2 groups
If the confidence interval overlaps, the two groups are not significantly different
Eg: if the MIzyme levels in the normal group is 10 ± 3 and in MI group 14 ± 1, there is an overlap
of at least one value, ie 13, and thus there cannot be a significant difference
Correlation
Pearson coefficient
● A number between -1 and +1
● Positivity and negativity shows whether the relationship is a direct proportion or inverse
proportion
● Greater the magnitude, less the spread of points, stronger the relation
● 0 means no relation
18
Coefficient of Determination
● r2 value is reported rather than r
● Always a positive value, obviously
● Indicates how much of the y value fluctuates due to changes in x value, in %
19
Study Designs
Cross-sectional study
1. Patients studied are part of a specific group
a. Medical students
b. Citizens of Mumbai
c. Tall people
2. Frequencies identified
a. Of a risk factor
b. Of a disease
Eg: How many citizens of Mumbai smoke?
3. It’s a snapshot in time
a. Patients are not followed
4. Main outcome is prevalence
a. 50% of Mumbaikars smoke
b. 5% of Mumbaikars have lung cancer
5. May have more than one group
a. One group of men, one of women
6. Things that can’t be measured
a. How much smoking increases the risk of lung cancer
b. Odds of getting LC in smokers vs nonsmokers
Check:
1. Is there a time frame?
2. What data has been collected?
3. Was there a selection criteria, besides the population under study? → should be absent
for a cross-sectional
Case Series
1. Purely descriptive study, common for a new disease of unknown cause
2. Multiple cases of a condition are analyzed
a. Patient demographics, symptoms and other factors
3. Done to look for etiology or course
Cohort studies
1. Compares (group with exposure) to (group without exposure)
a. Identify two groups by risk factor
b. Follow-up
2. Does Exposure change likelihood of disease?
a. Main outcome is a risk ratio
20
Case-Control Study
1. Compares (group with disease) to (group without disease)
a. Looks for exposure to risk factors
b. The opposite of a cohort study in that you start with disease instead of risk
c. Better for diseases with low incidence rates
2. Both groups should have both exposed and unexposed individuals
a. One cannot be selective in only having a group of exposed sufferers and
unexposed non-sufferers
b. Both the cases and controls should have both exposed and unexposed
3. Matching
a. Minimize the differences between cases and control
i. Ideally, they should be identical except in the presence or absence of
disease
b. Reduces confounding factors (other factors which may affect the disease)
4. Output is the odds ratio
a. Odds of disease in exposed to odds of disease in unexposed
5. Different from a Randomized drug trial
a. Patients identified by disease
21
Check:
1. Diseased and undiseased
Case-Control Cohort
Odds ratio (of developing disease) Relative risk (of developing disease)
Risk Quantification
Why is it important?
● Understanding of disease comes from estimating risk
○ Smoking increases carcinoma risk
○ Exercise reduces MI risk
● These things are understood by quantifying risk
1. Data Collection
a. Variables
i. Exposure to risk factor?
ii. Suffered disease?
b. Obtained from
i. Case-Control
ii. Cohort studies
2. Tabulation in a 2x2 table
Not valid in Case-Control studies: Odds ratio does not change with case
Changes depending on the number of cases number
taken A:C will be constant
B:D will be constant
Incidence
1. The number of NEW cases developing per unit time
2. The incidence rate is the number of new cases per unit time per number of healthy
people
Prevalence
1. The number of cases active
● The diagnostic cut off point for the test determines its sensitivity and specificity
26
● Sensitivity
○ How accurately the test determines positivity among all positive patients
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
○ 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑇𝑜𝑡𝑎𝑙 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
= 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
○ The more to the left you take the cutoff, the fewer the false negatives, better the
sensitivity
○ This is at the expense of specificity
● Specificity
○ How accurately can the test differentiate a negative from a positive patient
○ OR How many negatives were truly negative (differentiated from a positive pt)
𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
○ 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑜𝑡𝑎𝑙 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
= 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
27
○ The more to the right you take the cutoff point, the fewer false positives
● Right curve has a higher specificity and sensitivity due to less overlap
● Higher the sensitivity → Better at ruling out the disease (fewer false negatives, cost of
false positives)
● Higher the specificity → Better at ruling in the disease (fewer false positives,
confirmatory)
28
Accuracy Precision
ROC Curves
“Receiver Operating Characteristics”
Dependant on the test itself
The more the bend, the more sensitivity you can gain with minimal effect on specificity
Greater the AUC. better the test
Likelihood ratios
Skip this shit
32
Bias
1. Selection Bias
Bias in selection of group or retention of patients
a. Sampling Bias
i. Patients selected for the study are not representative
ii. Study findings may not apply to the population
b. Attrition Bias
i. Seen in prospective studies
ii. Patients lost to follow up unequally
iii. Eg: Smokers may die which may result in reduced morbidity seen in
smoking population
c. Berkson’s Bias
i. Hospitalized patients are chosen to be either case or control
ii. Usually have more severe symptoms
iii. Also have better access to care
iv. Alter the results of the study
d. Participation Bias/Nonresponsive Bias
i. Optional surveys often see this
ii. Due to convenience sampling
iii. Those who choose not to respond are not included
iv. Respondents may only represent a specific group
e. Prevalence Bias (Neyman Bias)
i. Exposure occurs long before disease assessment
ii. Patients may die quickly
iii. Prevalence calculated based on survivors
f. Length Time Bias
i. Patients of severe disease die
ii. These patients do not get studied
iii. Eg: HIV+ patient study may show that the disease is asymptomatic (those
affected have died)
g. Lead Time Bias
i. Screening test identifies disease earlier than in the general population
ii. Survival may appear longer than it is
2. Measurement Bias
a. Recall Bias
i. Inaccurate recall of past events by subjects of study
ii. Common in surveys
33
b. Observer BIas
i. Investigator knows the status of the patient
ii. Avoided by blinding
c. Procedure Bias
i. A group receiving a procedure is more likely to get the care and attention
ii. Avoid by:
1. Blinding of the care team
2. Placebo surgery
d. Hawthorne Effect
i. Patients or care providers change behaviour patterns because they are
being observed
ii. May result in improvement of condition
e. Pygmalion Effect (Observer Expectancy effect)
i. The researcher believes in the efficacy of the treatment
ii. This influences the study outcome
3. Confounding Bias
a. An unmeasured factor confounds the study result
Example
b. Alcoholics and Lung cancer
i. Alcoholics show a higher prevalence
ii. Smoking is more prevalent in alcoholics
iii. Smoking is the true cause of the cancer prevalence
c. Stratified analysis
i. Eliminates Confounding bias
ii. Done by further dividing the sample into 2 as per the confounding factor
34
Minimizing Bias
1. Randomization
2. Blinding
3. Matching
4. Crossover studies
a. Two groups
b. But each subject is also their own control
c. Can be done in therapy studies
d. First time period:
i. Group A given drug
ii. Group B given placebo
e. Wash out period to prevent carry over of drug effect
f. Second time period:
i. Group A given placebo
ii. Group B given drug
g. Avoids confounding factors entirely
Effect Modification
● Not a kind of bias
● Some third factor alters the effect of exposure
● Eg: Presence of a gene X determines whether a drug A causes DVT
● Gene X is the effect modifier
● Eliminated via stratified analysis
35
Clinical Trials
1. If you want to test a drug
a. Give the drug to the vulnerable population
b. See if there's an effect
2. This has problems
a. There may be no real effect due to the drug
b. Placebo effect
c. Behaviours changed due to knowlegde of receiving drug
3. Clinical trial features
a. Control
i. A group receiving the drug is compared with a group that does not
(placebo only given on blinding. No blinding → give nothing)
ii. Ensures effects not due to chance
b. Randomization
i. Equal distribution of variables
ii. Prevents confounding
iii. Limits selection bias (participants can't choose drug or placebo)
iv. Two groups must be shown to no be significantly different
c. Blinding
i. Control subjects given placebo
ii. Single blind: Subject doesn't know if drug or placebo
iii. Double blind: Subject and provider unaware
iv. Triple blind: above + data analyists don't know which group recieved drug
4. Clinical trials not used for everything
a. Expensive
b. Long duration
c. New treatment options may emerge
d. Parachute example
i. Some treatments have obvious effects
ii. Unethical to perform an experiment on whether parachutes decrease
mortality on jumping off a plane
5. Data and reporting
Example:
3 year survival with drug X = 20%
3 year survival control = 50%
a. Absolute risk reduction = 50 -20 = 30%
b. Relative risk reduction→ 30/50 = 60% relative reduction
c. Number needed to treat
i. The number of patients to be treated to prevent 1 adverse event
ii. = 1/Absolute risk reduction
6. Meta Analysis
a. Study that puts together many individual studies' data
b. Increases the number of subjects and controls
37
■ Pyramid of evidence
Fin.