BioStats and Epidemiology BNB Notes

1
Biostatistics and
Epidemiology
2023
Notes by Jason Sequeira
Batch of 2020, GSMC and KEM Hospital, Mumbai
Based on videos by Boards and Beyond and content from the internet
2
Contents
The Gaussian Distribution 4
Central Tendency 4
Measures of Dispersion 6
Standard Deviation - Population 6
Standard Deviation - Sample 7
Variance 8
Z score 8
The True Mean of the Population 9
Standard error of the mean 9
Confidence interval 10
Hypothesis testing 11
The P value 12
Power 13
Type 1 error (α error) 13
Type 2 error (β error) 13
Tests of Significance 14
The P value 14
Choosing a test or Data Types 14
T-test 14
Analysis of Variance (ANOVA) 16
Chi-square test 16
Comparing 2 groups 16
Significance of an odds ratio 16
Correlation 17
Pearson coefficient 17
Coefficient of Determination 18
Study Designs 19
Cross-sectional study 19
Case Series 19
Cohort studies 19
Case-Control Study 20
Risk Quantification 22
Sensitivity and Specificity 24

Incidence 24
Prevalence 24
Incidence and Prevalence 24
3
Sensitivity and Specificity 25

Positive Predictive Value 28
Negative Predictive Value 29
Accuracy 30
Precision 30
ROC Curves 31
Likelihood ratios 31
Bias 32
1. Selection Bias 32
2. Measurement Bias 32
3. Confounding Bias 33
Minimizing Bias 34
4. Crossover studies 34
Effect Modification 34
Clinical Trials 36
Evidence-Based Medicine 38
4
The Gaussian Distribution
Central Tendency
1. Mean: Sum/N
2. Median: Central value
a. Odd elements: central element
b. Even elements: mean of the central 2 elements
3. Mode:
a. Value with the highest frequency
b. The highest point on the graph
● In case the distribution is symmetrical, Mean, Median and Mode are all the same
● Above diagram shows a symmetrical Distribution
● Asymmetrical distributions → skewed
○ Negative Skew: More values lie below the mode

○ Positive Skew: More values lie above the mode
5
○ Mode is still the highest point, obviously

○ Mean moves the most towards the skew
○ Median moves between mean and mode
6
Measures of Dispersion
1. Standard deviation
2. Z score
3. Variance
Standard Deviation - Population
● In this, the data of the entire population is considered

● This is neither practical nor feasible
● A sample of the population is taken instead
7
Standard Deviation - Sample
● N-1 is the correction for changing μ to xi

● Can be thought of as the average of the difference of each value from the mean
● Squared to obtain the absolute values
● Use that 99.7 as 100% in practice

8
Variance
● Variance in statistics is the average squared distance between the data
points and the mean.
● Because it uses squared units rather than the natural data units, the
interpretation is less intuitive.
● Higher values indicate greater variability, but there is no intuitive interpretation for
specific values.
● Essentially the square of SD
Z score
● Defined for a single data point
● It is the number of SDs away from the mean that data point is
9
The True Mean of the Population

● By calculating central tendencies of a population, we are estimating the population mean
● The true mean will lie within a certain range +/- of this value
● The extent of this difference may be represented in 2 ways:
1. Standard error of the mean
2. Confidence intervals
Standard error of the mean

● Measures how far the data set is from the true mean
● σ is the SD of the sample

● n is the sample size
● Standard Error is the standard deviation of a sampling distribution of statistics.
● Multiple samples are collected
● Each sample has many individual data points
● Each sample has its mean and standard deviation.
● The Means can be plotted and you can get a sampling distribution
● Standard error is the SD of this graph
10
Confidence interval
● The range in which 95% of repeated measurements would be expected to fall
95% confidence
What range will you expect the newest point to fall in with 95% accuracy?
OR
95% of new data points will fall into which range?
● Use the Z score value for which 95% fall in

● It’s 2 as per the intro but it’s actually 1.96
●
● The more data points you find, the more likely the mean is closer to your found mean
● Thus, the tighter the range
● Hence the 𝑛
Don’t confuse standard deviation with confidence intervals

Standard deviation is for a dataset
Suppose we have ten samples
● These samples have a mean and standard deviation
● 95% of samples fall between +/- 2SD
Confidence intervals do not describe the samples in data set
● An inferred value of where the true mean lies for population
Example
Range in which 95% of measurements in a dataset fall = Mean +/- 2SD
Range in which true population mean likely falls = Confidence interval of the mean
11
Hypothesis testing
Do study results represent the population reality?
Or
The study found this, is it true universally or did we get this result by chance?
Usually, you compare 2 groups

Eg: A hypothetical enzyme MIzyme is hypothesized to be raised in MI patients
H0 → There is no difference of MIzyme levels in MI patients and normal individuals

H1 → There is a difference
The null hypothesis can be easily rejected if there is a vast difference between the 2 sets of data
The problem comes up when data spreads and overlaps
We’re mathematically calculating probabilities of the 2 groups being different

Depends on:
1. Difference between means
12
2. Scatter of data
3. Number of subjects
There are four possible outcomes from hypothesis testing

1. H0 rejected, is a correct reflection of reality
2. H1 rejected, is a correct reflection of reality
3. H0 rejected, is a false reflection of reality
4. H1 rejected, is a false reflection of reality
Power = Chance of detecting difference

α = Chance of seeing a difference that is not real
β = chance of missing a difference that is really there
Power = 1- β
Since 1 - probability of false negative = probability of true positive

If and only if H1 is actually true for the population
The P value
The probability that the null hypothesis is correct
It represents the probability of a false positive
A lower than 5% probability lets us reject the null hypothesis
13
Power
Chance of finding a difference when one exists
We always want a high power
Power is a value decided while designing the study
Determined by:
1. Number of subjects
2. Difference in means
3. Scatter of data
You can control the number of subjects to increase power
Type 1 error (α error)

Probability of a false positive when the null hypothesis is actually true
Due to random chance or
Improper research techniques
Type 2 error (β error)

Probability of a false negative when the null hypothesis should be rejected
May be due to
1. Small difference between groups
2. Low number of data points
14
Tests of Significance
Is the difference between these two groups significant?
OR
Are the differences we perceive real or a fluke?
OR
What is the probability that these differences are not real?
The P value
The probability that the null hypothesis is correct
It represents the probability of a false positive
A lower than 5% probability lets us reject the null hypothesis
Choosing a test or Data Types

Test types
1. T-test
2. ANOVA
3. Chi-square test
The choice of test depends on the type of data and the number of groups
T-test - 2 groups, quantitative variables

ANOVA - More than 2 groups, quantitative variables
Chi-square - Qualitative variables
T test
Compares 2 mean values
Outputs a t value
t value is the Z score of the sample in a population (the samples would form a normal
distribution) corrected for the degrees of freedom (n-1) it has
We check the probability of getting this z score
If it is less than 5, it would be incredibly unlikely for us to get this value
So our original assumption of the population mean is probably wrong
P value is a decimal probability
Increasing the number of patients may increase power and decrease the p-value
Null Hypothesis: There is no difference

15
This is assumed to, initially, be true

The t-test outputs the probability of getting the data and results we have collected
If this probability is extremely low, we should not have been able to get any other result.
But we HAVE collected this data, so we can say that the original hypothesis was probably
incorrect
The t test can be used for:

1. Calculated mean and presumed population mean (One sample T-test)
a. Null hypothesis: The presumed mean is a value we think is correct
b. We are checking if the mean we have calculated is significantly different from the
presumed population mean
c. If it is significantly different, we can reject our original presumption and accept the
calculated mean as the true value
d. Note how it’s Z score times root n

e. n(degrees of freedom) is N-1
2. The same variable for two different groups (are the groups different or not? No question
of direction of this difference. Two tailed T-test)
a. Null hypothesis is that they are the same
b. Run the t-test
c. If statistically significant, they are not the same
3. The same variable for two different groups (comparing the direction of difference. Eg: is
A less than B by this magnitude or more? One tailed T test)
a. Null hypothesis is that they are the same
16
b. Formula as above
4. Paired T test (the same population, different values of the same period before and after)
a.
Analysis of Variance (ANOVA)

Compares more than 2 mean values
Outputs a p value
● Null hypothesis: there is no difference

● Alternate hypothesis: they are not the same
Chi square test

Compares categorical data
Examples of categorical data

40% of patients is on blood thinners
Implies the other group of 60% is not
For method, refer sanyal
Comparing 2 groups
If the confidence interval overlaps, the two groups are not significantly different
Eg: if the MIzyme levels in the normal group is 10 ± 3 and in MI group 14 ± 1, there is an overlap
of at least one value, ie 13, and thus there cannot be a significant difference
Significance of an odds ratio

Refer to odds ratios
If the odds ratio interval includes 1, it is not significant
Eg: Risk of lung cancer among chemical workers studied, Risk ratio = 1.4 +/- 0.5
Includes 1 therefore risk is not significant
17
Correlation
Pearson coefficient
● A number between -1 and +1
● Positivity and negativity shows whether the relationship is a direct proportion or inverse
proportion
● Greater the magnitude, less the spread of points, stronger the relation
● 0 means no relation
18
Coefficient of Determination
● r2 value is reported rather than r
● Always a positive value, obviously
● Indicates how much of the y value fluctuates due to changes in x value, in %
19
Study Designs
Cross-sectional study
1. Patients studied are part of a specific group
a. Medical students
b. Citizens of Mumbai
c. Tall people
2. Frequencies identified
a. Of a risk factor
b. Of a disease
Eg: How many citizens of Mumbai smoke?
3. It’s a snapshot in time
a. Patients are not followed
4. Main outcome is prevalence
a. 50% of Mumbaikars smoke
b. 5% of Mumbaikars have lung cancer
5. May have more than one group
a. One group of men, one of women
6. Things that can’t be measured
a. How much smoking increases the risk of lung cancer
b. Odds of getting LC in smokers vs nonsmokers
Check:
1. Is there a time frame?
2. What data has been collected?
3. Was there a selection criteria, besides the population under study? → should be absent
for a cross-sectional
Case Series
1. Purely descriptive study, common for a new disease of unknown cause
2. Multiple cases of a condition are analyzed
a. Patient demographics, symptoms and other factors
3. Done to look for etiology or course
Cohort studies
1. Compares (group with exposure) to (group without exposure)
a. Identify two groups by risk factor
b. Follow-up
2. Does Exposure change likelihood of disease?
a. Main outcome is a risk ratio
20
i. How much does exposure increase risk of disease

Eg:
01. 50% of smokers get lung cancer within 5 years
02. 10% of non-smokers get lung cancer within 5 years
03. RR = 50/10 = 5
3. Can be PROSPECTIVE or RETROSPECTIVE
a. Prospective: Monitor the groups with risk factors over time
b. Retrospective: Monitor risk factor groups and see if they have history of the
disease
i. Eg history of pneumonia in smokers vs non-smokers
4. Problems
a. Do not work for rare diseases
i. Lots of patients would be required for the disease to arise in the groups
exposed or unexposed to the risk factor
ii. Eg in prospective: groups of smokers and nonsmokers, there is a chance
that 0 patients develop lung cancer. It would be easier to identify patients
with lung cancer and then trace them back (case-control)
Check:
1. Time frame
2. 2 categories or subcategories of a specific population
3. Check retrospective or prospective
Case-Control Study
1. Compares (group with disease) to (group without disease)
a. Looks for exposure to risk factors
b. The opposite of a cohort study in that you start with disease instead of risk
c. Better for diseases with low incidence rates
2. Both groups should have both exposed and unexposed individuals
a. One cannot be selective in only having a group of exposed sufferers and
unexposed non-sufferers
b. Both the cases and controls should have both exposed and unexposed
3. Matching
a. Minimize the differences between cases and control
i. Ideally, they should be identical except in the presence or absence of
disease
b. Reduces confounding factors (other factors which may affect the disease)
4. Output is the odds ratio
a. Odds of disease in exposed to odds of disease in unexposed
5. Different from a Randomized drug trial
a. Patients identified by disease
21
b. Exposure to drug is randomized
Check:
1. Diseased and undiseased
Case-Control Cohort
Patients categorized by disease Patients categorized by exposure to risk

factor
Odds ratio (of developing disease) Relative risk (of developing disease)
Cross-Sectional Case-Control Cohort
Identification by location or By Disease By exposure to risk factor

population
No time period Always retrospective May be Prospective or

Retrospective
Output is Prevalence Output is Odds ratio Output is Risk ratio

22
Risk Quantification
Why is it important?
● Understanding of disease comes from estimating risk
○ Smoking increases carcinoma risk
○ Exercise reduces MI risk
● These things are understood by quantifying risk
1. Data Collection
a. Variables
i. Exposure to risk factor?
ii. Suffered disease?
b. Obtained from
i. Case-Control
ii. Cohort studies
2. Tabulation in a 2x2 table
a. Used for calculation of

i. Risk of disease
ii. Risk ratio
iii. Odds ratio
iv. Attributable risk
v. Number needed to harm
b. Risk of disease in a group
Risk of developing with exposure is found
𝐴
i. Risk in exposed group = 𝐴+𝐵
𝐶
ii. Risk in unexposed group = 𝐶+𝐷
c. Risk Ratio
Risk w/ exposure to Risk w/o exposure
i. Usually comes from a Cohort study (follows groups w risk and w/o)
ii. Can be from 0 to ∞
1. RR = 1 → No increased risk from exposure
2. RR > 1 → Exposure increases risk
23
3. RR < 1 → Exposure is protective

d. Odds Ratio
Odds of exposure among diseased/Odds of exposure among nondiseased
i. What are odds? The ratio of probabilities of one event to another event
ii. Comes from Case-Control study
𝐴:𝐶
iii. Odds ratio = 𝐵:𝐷
iv. Can be from 0 to ∞
1. OR = 1 → Equal exposure among diseased and nondiseased
2. OR > 1 → Greater exposure found among diseased
3. OR < 1 → Lesser exposure found among diseased
Risk Ratio Odds Ratio
Preferred: Less preferred

Tells you how much exposure to risk factor
multiplies the risk of disease compared to no
exposure
Not valid in Case-Control studies: Odds ratio does not change with case
Changes depending on the number of cases number
taken A:C will be constant
B:D will be constant
e. Rare disease assumption

i. In case of a rare disease, B >> A and D>> C
ii. This means the OR ≈ RR
iii. A case-control study may then be used to calculate Relative RIsk
1. Case-control studies are cheaper
2. Odds ratio is a weak association
iv. Remeber that a RR reported for a Case-Control is VALID only for rare
disease
f. Attributable Risk
i. Risk in exposed - Risk in unexposed
ii. Additional risk due to exposure
g. Attributable Risk %
𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑎𝑏𝑙𝑒 𝑅𝑖𝑠𝑘
i. 𝑅𝑖𝑠𝑘 𝑖𝑛 𝐸𝑥𝑝𝑜𝑠𝑒𝑑
h. Number needed to harm
i. The number of patients who need to be exposed for one case to arise
1
ii. = 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑎𝑏𝑙𝑒 𝑟𝑖𝑠𝑘
24
Sensitivity and Specificity
Incidence
1. The number of NEW cases developing per unit time
2. The incidence rate is the number of new cases per unit time per number of healthy
people
Prevalence
1. The number of cases active
Incidence and Prevalence

1. Chronic disease
a. Prevalence >>> Incidence
b. Eg 5 new cases per year but the 5 of each previous year are still alive
2. Rapidly fatal disease
a. Incidence >>> Prevalence
b. Eg 20 new cases, all previous died
3. New Primary Prevention Program
a. Incidence and Prevalence fall
4. New drug to improve survival
a. Incidence unchanged
b. Prevalence increases
25
Sensitivity and Specificity
● The diagnostic cut off point for the test determines its sensitivity and specificity
26
● There will be 4 kinds of results

○ True positive
○ False Positive
○ True Negative
○ False Negative
● Sensitivity
○ How accurately the test determines positivity among all positive patients
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
○ 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑇𝑜𝑡𝑎𝑙 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
= 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
○ The more to the left you take the cutoff, the fewer the false negatives, better the
sensitivity
○ This is at the expense of specificity
● Specificity
○ How accurately can the test differentiate a negative from a positive patient
○ OR How many negatives were truly negative (differentiated from a positive pt)
𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
○ 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑜𝑡𝑎𝑙 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
= 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
27
○ The more to the right you take the cutoff point, the fewer false positives
● Maximum sensitivity and specificity at extremes

● Equal weightage at the intersection of curves
● Specificity and Sensitivity are based on the test and not on the prevalence of the disease
● Right curve has a higher specificity and sensitivity due to less overlap
● Higher the sensitivity → Better at ruling out the disease (fewer false negatives, cost of
false positives)
● Higher the specificity → Better at ruling in the disease (fewer false positives,
confirmatory)
28
● For Rare diseases

○ Screen with a sensitive test
■ Very few people are actually positive
■ All positives will truly show up

■ A few false positives will also show up
■ The result of a sensitive test is believable if positive
○ Follow up a positive with a specific screening test to rule out false positives
Positive Predictive Value

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
How many of the positive results are truly positive = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
Dependant on actual prevalence of disease
Higher when prevalence is higher
29
The end of the Normal curve has a PPV of 1
Negative Predictive Value

How many of the negative results are truly negative
Dependant on actual prevalence of disease
Higher when prevalence is lower
30
Accuracy Precision
Aka VALIDITY Aka RELIABILITY

How closely measurement How close repeat
matches reality measurements are to each
other
More Precise tests = smaller

SD of a normal distribution of
repeated results
31
Reduced due to Systemic Error Reduced due to random

Eg difference of 10 mmHg due measurement error
to wrong cuff size
ROC Curves
“Receiver Operating Characteristics”
Dependant on the test itself
A graph is plotted between sensitivity and 1-specificitiy

Sensitivity is proportional to the true positive rate
1-specificity is proportional to the false positive rate
The more the bend, the more sensitivity you can gain with minimal effect on specificity
Greater the AUC. better the test
Likelihood ratios
Skip this shit
32
Bias
1. Selection Bias
Bias in selection of group or retention of patients
a. Sampling Bias
i. Patients selected for the study are not representative
ii. Study findings may not apply to the population
b. Attrition Bias
i. Seen in prospective studies
ii. Patients lost to follow up unequally
iii. Eg: Smokers may die which may result in reduced morbidity seen in
smoking population
c. Berkson’s Bias
i. Hospitalized patients are chosen to be either case or control
ii. Usually have more severe symptoms
iii. Also have better access to care
iv. Alter the results of the study
d. Participation Bias/Nonresponsive Bias
i. Optional surveys often see this
ii. Due to convenience sampling
iii. Those who choose not to respond are not included
iv. Respondents may only represent a specific group
e. Prevalence Bias (Neyman Bias)
i. Exposure occurs long before disease assessment
ii. Patients may die quickly
iii. Prevalence calculated based on survivors
f. Length Time Bias
i. Patients of severe disease die
ii. These patients do not get studied
iii. Eg: HIV+ patient study may show that the disease is asymptomatic (those
affected have died)
g. Lead Time Bias
i. Screening test identifies disease earlier than in the general population
ii. Survival may appear longer than it is
2. Measurement Bias
a. Recall Bias
i. Inaccurate recall of past events by subjects of study
ii. Common in surveys
33
b. Observer BIas
i. Investigator knows the status of the patient
ii. Avoided by blinding
c. Procedure Bias
i. A group receiving a procedure is more likely to get the care and attention
ii. Avoid by:
1. Blinding of the care team
2. Placebo surgery
d. Hawthorne Effect
i. Patients or care providers change behaviour patterns because they are
being observed
ii. May result in improvement of condition
e. Pygmalion Effect (Observer Expectancy effect)
i. The researcher believes in the efficacy of the treatment
ii. This influences the study outcome
3. Confounding Bias
a. An unmeasured factor confounds the study result
Example
b. Alcoholics and Lung cancer
i. Alcoholics show a higher prevalence
ii. Smoking is more prevalent in alcoholics
iii. Smoking is the true cause of the cancer prevalence
c. Stratified analysis
i. Eliminates Confounding bias
ii. Done by further dividing the sample into 2 as per the confounding factor
34
1. Among alcohol users without stratification, the alcoholics showed

a high Risk Ratio due to increased incidence of smoking amongst
them
2. If the population is divided into smokers and non-smokers, the
following is seen:
a. Among smokers, there is no increased risk due to alcohol
b. Amongst non-smokers, there is no increased risk due to
alcohol
3. Thus, alcohol itself is not causing the increased risk
d. Controlling for confounders (prevention of confounding)
i. Randomization
1. Ensures equal variable distribution in both arms
ii. Matching
1. Done in Case-Control studies
2. Each case subject is matched with a very similar control
Minimizing Bias
1. Randomization
2. Blinding
3. Matching
4. Crossover studies
a. Two groups
b. But each subject is also their own control
c. Can be done in therapy studies
d. First time period:
i. Group A given drug
ii. Group B given placebo
e. Wash out period to prevent carry over of drug effect
f. Second time period:
i. Group A given placebo
ii. Group B given drug
g. Avoids confounding factors entirely
Effect Modification
● Not a kind of bias
● Some third factor alters the effect of exposure
● Eg: Presence of a gene X determines whether a drug A causes DVT
● Gene X is the effect modifier
● Eliminated via stratified analysis
35
● Stratify as per effect modifier

● Group with effect modifier shows elevated risk ratio on exposure
36
Clinical Trials
1. If you want to test a drug
a. Give the drug to the vulnerable population
b. See if there's an effect
2. This has problems
a. There may be no real effect due to the drug
b. Placebo effect
c. Behaviours changed due to knowlegde of receiving drug
3. Clinical trial features
a. Control
i. A group receiving the drug is compared with a group that does not
(placebo only given on blinding. No blinding → give nothing)
ii. Ensures effects not due to chance
b. Randomization
i. Equal distribution of variables
ii. Prevents confounding
iii. Limits selection bias (participants can't choose drug or placebo)
iv. Two groups must be shown to no be significantly different
c. Blinding
i. Control subjects given placebo
ii. Single blind: Subject doesn't know if drug or placebo
iii. Double blind: Subject and provider unaware
iv. Triple blind: above + data analyists don't know which group recieved drug
4. Clinical trials not used for everything
a. Expensive
b. Long duration
c. New treatment options may emerge
d. Parachute example
i. Some treatments have obvious effects
ii. Unethical to perform an experiment on whether parachutes decrease
mortality on jumping off a plane
5. Data and reporting
Example:
3 year survival with drug X = 20%
3 year survival control = 50%
a. Absolute risk reduction = 50 -20 = 30%
b. Relative risk reduction→ 30/50 = 60% relative reduction
c. Number needed to treat
i. The number of patients to be treated to prevent 1 adverse event
ii. = 1/Absolute risk reduction
6. Meta Analysis
a. Study that puts together many individual studies' data
b. Increases the number of subjects and controls
37
c. Increases the statistical power

d. Problems
i. Differing selection criterias
ii. Treatment may be given differently
iii. Selection bias in some studies
7. New drug approval
Animal studies positive
a. Phase 1 trial
i. Clinical trial in healthy volunteers
1. Exception: Cancer therapy
ii. Determines
1. Safety
2. Toxicity
3. Pharmacokinetics
b. Phase 2
i. Small number of sick patients
ii. Determines:
1. Efficacy
2. Dosing
3. Side effects
iii. Placebo controlled and blinded
c. Phase 3
i. Large number of sick patients
ii. Usually multicentered
iii. Mainly determines efficacy vs placebo or standard care
iv. After this phase, drugs are approved by the drug controller
d. Phase 4
i. Post marketing surveillance
ii. Uses
1. Monitors long term effects
2. Different population of patients than those tested on may be using
the drug
38
Evidence Based Medicine

1. Care for patient based on best available research
2. 4. Basic. Components
○ Formulate a research question
■ Should be specific
■ Should be answerable from literature
■ PICO model
● Population and characteristics
● Intervention being considered
● Compared to?
● Outcome wanted - 4 kinds
○ Hard outcomes (objective)
■ Easily identifiable and important to patients
■ Death rate
■ Amputation rate
○ Soft outcome (subjective)
■ Improved quality of life
○ Surrogate outcome
■ Predictive of a hard outcome
■ Eg: HbA1C levels predictive of diabetes
complications
■ Advantages
● Easier to measure
■ Disadvantage
● Can lead to erroneous findings
○ Composite outcomes
■ Pool of multiple outcomes
■ Increases statistical power
■ Sometimes one component may drive the
composite outcome (misrepresentation in
advertisment)
○ Identify the best available evidence

■ Types of evidence:
● Primary Resources (earliest)
○ Case reports or Case series
○ Observational studies
○ RCTs
● Systematic reviews and Meta analyses
○ Compilation of primary studies
● Society giidelines (last)
○ Based on primary and secondary research
39
■ Pyramid of evidence
○ Assess the evidence

■ Internal validity
● Research conducted properly?
● Conclusions correct?
● Is there bias?
● Is the result due to chance or significant?
■ External Validity
● Are study patients similar to real world patients?
● Do we actually perform this intervention in the real world?
● Does it apply to my patient?
○ Apply the evidence in practice
■ Factor in your clinical expertise
■ Factor in your patient wishes
Fin.

BioStats and Epidemiology BNB Notes

Uploaded by

Copyright:

Available Formats

BioStats and Epidemiology BNB Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BioStats and Epidemiology BNB Notes

Uploaded by

Copyright:

Available Formats

1

Notes by Jason Sequeira

Batch of 2020, GSMC and KEM Hospital, Mumbai

Sensitivity and Specificity 24

Sensitivity and Specificity 25

The Gaussian Distribution

○ Negative Skew: More values lie below the mode

○ Mode is still the highest point, obviously

Standard Deviation - Population

● In this, the data of the entire population is considered

Standard Deviation - Sample

● N-1 is the correction for changing μ to xi

● Use that 99.7 as 100% in practice

● Essentially the square of SD

The True Mean of the Population

Standard error of the mean

● σ is the SD of the sample

● Use the Z score value for which 95% fall in

Don’t confuse standard deviation with confidence intervals

Usually, you compare 2 groups

H0 → There is no difference of MIzyme levels in MI patients and normal individuals

The problem comes up when data spreads and overlaps

We’re mathematically calculating probabilities of the 2 groups being different

There are four possible outcomes from hypothesis testing

Power = Chance of detecting difference

Since 1 - probability of false negative = probability of true positive

You can control the number of subjects to increase power

Type 1 error (α error)

Type 2 error (β error)

Choosing a test or Data Types

T-test - 2 groups, quantitative variables

P value is a decimal probability

Null Hypothesis: There is no difference

This is assumed to, initially, be true

The t test can be used for:

d. Note how it’s Z score times root n

Analysis of Variance (ANOVA)

● Null hypothesis: there is no difference

Chi square test

Examples of categorical data

For method, refer sanyal

Significance of an odds ratio

i. How much does exposure increase risk of disease

b. Exposure to drug is randomized

Patients categorized by disease Patients categorized by exposure to risk

Cross-Sectional Case-Control Cohort

Identification by location or By Disease By exposure to risk factor

No time period Always retrospective May be Prospective or

Output is Prevalence Output is Odds ratio Output is Risk ratio

a. Used for calculation of

3. RR < 1 → Exposure is protective

Risk Ratio Odds Ratio

Preferred: Less preferred

e. Rare disease assumption

Sensitivity and Specificity

Incidence and Prevalence

Sensitivity and Specificity

● There will be 4 kinds of results

● Maximum sensitivity and specificity at extremes