Correlation and Regression
Correlation and Regression
Correlation and Regression
Professor of Biostatistics Department of Epidemiology Director, Data Coordinating Center College of Human Medicine Michigan State University
Example
A researcher believes that there is a linear relationship between BMI (Kg/m2) of pregnant mothers and the birth-weight (BW in Kg) of their newborn
The following data set provide information on 15 pregnant mothers who were contacted for this study
BMI (Kg/m2) 20 30 50 45 10 30 40 25 50 20 10 55 60 50 35
Birth-weight (Kg) 2.7 2.9 3.4 3.0 2.2 3.1 3.3 2.3 3.5 2.5 1.5 3.8 3.7 3.1 2.8
Scatter Diagram
Scatter diagram is a graphical method to display the relationship between two variables Scatter diagram plots pairs of bivariate observations (x, y) on the X-Y plane
Correlation Coefficient, R
R is a measure of strength of the linear association between two variables, x and y.
Most statistical packages and some hand calculators can calculate R For the data in our Example R=0.94 R has some unique characteristics
Correlation Coefficient, R
R takes values between -1 and +1 R=0 represents no linear relationship between the two variables R>0 implies a direct linear relationship R<0 implies an inverse linear relationship The closer R comes to either +1 or -1, the stronger is the linear relationship
Coefficient of Determination
R2 is another important measure of linear association between x and y (0 R2 1) R2 measures the proportion of the total variation in y which is explained by x For example r2 = 0.8751, indicates that 87.51% of the variation in BW is explained by the independent variable x (BMI).
The following data consists of age (in years) and presence or absence of evidence of significant coronary heart disease (CHD) in 100 persons. Code sheet for the data is given as follows:
Serial No. 1. 2. Variable
name
ID AGRP
Variable description
Identification no. Age Group
Codes/values
ID number (unique) 1 = 20-29; 2 = 30-34; 3 = 35-39; 4 = 40-44; 5 = 45-49; 6 = 50-54; 7 = 55-59; 8 = 60-69 in years 0 = Absent; 1 = Present
3. 4.
AGE CHD
ID
1 2 3 4 5 6 7 8
AGRP
1 1 1 1 1 1 1 1 8 8
AGE
20 23 24 25 25 26 26 28 65 69
CHD
0 0 0 0 1 0 0 0 1 1
99 100
Absent 32 25
57
7 36
43
39 61
100
Chi-Square Tests Asymp. Sig. (2-sided) 1 1 1 .000 .000 .000 .000 17.434 100 1 .000 .000 Exact Sig. (2-sided) Exact Sig. (1-sided)
Value Pearson Chi-Square Continuitya Correction Likelihood Ratio Fisher's Exact Test Linear-by-Linear Association N of Valid Cases 17.610 15.919 18.706
b
df
a. Computed only for a 2x2 table b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 17.16.
Odds Ratio = 0.14 with 95% confidence interval (0.05,0.41) Relative Risk = 0.30 with 95% confidence interval (0.15,0.60)
What about a situation that you do not want to categorize the age?
PLOT OF CHD by AGE
1.2
1.0
.8 .6 .4 .2 0.0 -.2 10 20 30 40 50 60 70
Actually, we are interested in knowing whether the probability of having CHD increases by age.
CHD n 10 15 12 15 13 08 17 10
100
Mean (proportion) =
Absent 09 13 09 10 07 03 04 02
57
Present 01 02 03 05 06 05 13 08
43
{(Present)/n} (01/10) = 0.10 (02/15) = 0.13 (03/12) = 0.25 (05/15) = 0.33 (06/13) = 0.46 (05/08) = 0.63 (13/17) = 0.76 (08/10) = 0.80
(43/100) = 0.43
Logistic Regression
Logistic Regression is used when the outcome variable is categorical The independent variables could be either categorical or continuous The slope coefficient in the Logistic Regression Model has a relationship with the OR Multiple Logistic Regression model can be used to adjust for the effect of other variables when assessing the association between E & D variables