ISYE6414 HW1 Solutions
ISYE6414 HW1 Solutions
ISYE6414 HW1 Solutions
Part A. ANOVA
Additional Material: ANOVA tutorial
https://datascienceplus.com/one-way-anova-in-r/
Jet lag is a common problem for people traveling across multiple time zones, but people can gradually adjust
to the new time zone since the exposure of the shifted light schedule to their eyes can resets the internal
circadian rhythm in a process called “phase shift”. Campbell and Murphy (1998) in a highly controversial
study reported that the human circadian clock can also be reset by only exposing the back of the knee
to light, with some hailing this as a major discovery and others challenging aspects of the experimental
design. The table below is taken from a later experiment by Wright and Czeisler (2002) that re-examined
the phenomenon. The new experiment measured circadian rhythm through the daily cycle of melatonin
production in 22 subjects randomly assigned to one of three light treatments. Subjects were woken from
sleep and for three hours were exposed to bright lights applied to the eyes only, to the knees only or to
neither (control group). The effects of treatment to the circadian rhythm were measured two days later by
the magnitude of phase shift (measured in hours) in each subject’s daily cycle of melatonin production. A
negative measurement indicates a delay in melatonin production, a predicted effect of light treatment, while
a positive number indicates an advance.
Raw data of phase shift, in hours, for the circadian rhythm experiment
Question A1 - 3 pts
Fill in the missing values in the analysis of the variance table. Note: Missing values can be calculated using
the corresponding formulas provided in the lectures, or you can build the data frame in R and generate the
ANOVA table using the aov() function. Either approach will be accepted.
1
Source Df Sum of Squares Mean Squares F-statistics p-value
Error 19 9.415 0.4955
TOTAL 21 16.639
DfT reatments = k − 1 = 2
DfError = N − k = 22 − 3 = 19
DfT otal = DfT reatments + DfErorr = (N − k) + (k − 1) = N − 1 = 21
SSTR = M STR × (k − 1) = 3.6122 × 2 = 7.224
SST = SSE + SSTR = 7.224 + 9.415 = 16.639
M SE = SSE/(N − k) = 9.415/19 = 0.4955
F -test = M STR /M SE = 3.6122/0.4955 = 7.29
Question A2 - 3 pts
Use µ1 , µ2 , and µ3 as notation for the three mean parameters and define these parameters clearly based
on the context of the topic above (i.e. explain what µ1 , µ2 , and µ3 mean in words in the context of this
problem). Find the estimates of these parameters.
• µ1 : true mean phase shift for subjects in Control group. Its estimate, µ̂1 , is -0.3088
• µ2 : true mean phase shift for subjects in Knees group. Its estimate, µ̂2 , is -0.3357
• µ3 : true mean phase shift for subjects in Eyes group. Its estimate, µ̂3 , is -1.5514
Question A3 - 5 pts
Use the ANOVA table in Question A1 to write the:
H0 : µ1 =µ2 =µ3
HA : At least 2 of the means are not equal (µ1 6=µ2 and/or µ1 6=µ3 and/or µ3 6=µ2 )
c. 1 pts Fill in the blanks for the degrees of freedom of the ANOVA F -test statistic:
F (k − 1, N − k) = F (2, 19)
e. 1 pts According to the results of the ANOVA F -test, does light treatment affect phase shift? Use an
α-level of 0.05.
We reject the null hypothesis that all three means are equal because the p-value is much smaller than 0.05.
Therefore, the mean of the phase shift is not the same for all three treatment groups, and we conclude that
light treatment does affect phase shift.
2
Part B. Simple Linear Regression
We are going to use regression analysis to estimate the performance of CPUs based on the maximum number
of channels in the CPU. This data set comes from the UCI Machine Learning Repository.
The data file includes the following columns:
The data is in the file “machine.csv”. To read the data in R, save the file in your working directory (make
sure you have changed the directory if different from the R working directory) and read the data using the
R function read.csv().
# Import libraries
library(ggplot2)
library(ggpubr)
library(car)
a. 3 pts Use a scatter plot to describe the relationship between CPU performance and the maximum
number of channels. Describe the general trend (direction and form). Include plots and R-code used.
3
Maximum Channels vs CPU Performance
1200
900
CPU Performance
600
300
0 50 100 150
Maximum Channels
There seems to be a positive, linear relationship of moderate strength between CPU performance and the
maximum number of channels. There is a general increasing trend in CPU performance as the maximum
channels increases. As the maximum number of channels increases the variance of CPU performance appears
to increase as well.
b. 3 pts What is the value of the correlation coefficient between performance and chmax? Please interpret
the strength of the correlation based on the correlation coefficient.
## [1] 0.6052093
The correlation coefficient of 0.6052093 suggests that we have a moderate positive linear relationship between
chmax and performance.
c. 2 pts Based on this exploratory analysis, would you recommend a simple linear regression model for
the relationship?
I would recommend attempting a simple linear regression model because it is easy to interpret, but we are
likely going to want to attempt a Box-Cox transformation to reduce the heteroskedasticity.
Note: Any other logical answer is acceptable for full credit.
d. 1 pts Based on the analysis above, would you pursue a transformation of the data?
4
Yes, I would recommend transforming the data using a Box-Cox transformation because of the heteroskedas-
ticity in CPU performance as maximum channels increases.
Note: Any other logical answer is acceptable for full credit.
Fit a linear regression model, named model1, to evaluate the relationship between performance and the
maximum number of channels. Do not transform the data. The function you should use in R is:
a. 3 pts What are the model parameters and what are their estimates?
summary(model1)
##
## Call:
## lm(formula = performance ~ chmax, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -486.47 -42.20 -22.20 20.31 867.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2252 10.8587 3.428 0.000733 ***
## chmax 3.7441 0.3423 10.938 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 128.3 on 207 degrees of freedom
## Multiple R-squared: 0.3663, Adjusted R-squared: 0.3632
## F-statistic: 119.6 on 1 and 207 DF, p-value: < 2.2e-16
sigsq = summary(model1)$sigma ** 2
b. 2 pts Write down the equation for the simple linear regression model.
c. 2 pts Interpret the estimated value of the β1 parameter in the context of the problem.
A one unit increase in maximum channels increases CPU performance by 3.7441 units on average.
5
d. 2 pts Find a 95% confidence interval for the β1 parameter. Is β1 statistically significant at this level?
confint(model1)['chmax',]
## 2.5 % 97.5 %
## 3.069251 4.418926
The 95% confidence interval has a lower bound of 3.069251 and an upper bound of 4.418926. Given that the
confidence interval does not include zero, β1 is statistically significant at this level.
e. 2 pts Is β1 statistically significantly positive at an α-level of 0.01? What is the approximate p-value
of this test?
• H0 : β1 ≤ 0
• HA : β1 > 0
We need to conduct a one-sided t-test on β1 . We can extract the t-value from the summary table of model1
and the degrees of freedom from model1. We then calculate the distribution function on the upper tail, since
we are testing if β1 is positive.
## [1] 1.423882e-22
The p-value is 1.423882 × 10−22 , which is approximately equal to zero. Since this value is less than the
α-level of 0.01, we conclude that β1 is statistically significantly positive.
Create and interpret the following graphs with respect to the assumptions of the linear regression model.
In other words, comment on whether there are any apparent departures from the assumptions of the linear
regression model. Make sure that you state the model assumptions and assess each one. Each graph may be
used to assess one or more model assumptions.
a. 2 pts Scatterplot of the data with chmax on the x-axis and performance on the y-axis
6
Maximum Channels vs CPU Performance
1200
900
CPU Performance
600
300
0 50 100 150
Maximum Channels
Model Assumption(s) it checks: Linearity/Mean Zero, Independence (“Uncorrelated errors”) and Con-
stant Variance.
Note: Any one of the assumptions above is acceptable for full credit.
Interpretation: There seems to be issues with large values of chmax not evenly distributed across the
model line. For instance, between 40 and 150 maximum channels nearly all of the data points fall below the
model line. This suggests that the linearity assumption may not strongly hold.
b. 3 pts Residual plot - a plot of the residuals, ˆi , versus the fitted values, ŷi
7
Residual Plot
500
Residuals
−500
200 400 600
Fitted Values
hp = qplot(model1$residuals,
geom="histogram",
binwidth=100,
main = "Histogram of Residuals",
xlab = "Residuals",
ylab = "Count",
fill=I("blue"),
alpha=I(0.2))
8
stat_qq(alpha=I(0.2),color='darkorange') +
stat_qq_line() +
ggtitle("Q-Q Plot of Residuals")
100
500
Count
50 0
0 −500
−500 0 500 1000 −3 −2 −1 0 1 2 3
Residuals x
a. 2 pts Use a Box-Cox transformation (boxCox()) in car() package or (boxcox()) in MASS() package
to find the optimal λ value rounded to the nearest half integer. What transformation of the response,
if any, does it suggest to perform?
9
95%
−1400
−1600
log−likelihood
−1800
−2000
−2 −1 0 1 2
## Optimal lambda: 0
The optimal lambda value is zero, suggesting that the log of the response may improve normality and/or
constant variance.
b. 2 pts Create a linear regression model, named model2, that uses the log transformed performance as
the response, and the log transformed chmax as the predictor.
e. 2 pts Compare the R-squared values of model1 and model2. Did the transformation improve the
explanatory power of the model?
r2m1 = summary(model1)$r.squared
r2m2 = summary(model2)$r.squared
10
## R-squared of model1 is: 0.3662783
The R2 value of model1 is 0.366, and the R2 value of model2 is 0.410. This indicates that there is an
improvement in the explanatory power of the model.
c. 4 pts Similar to Question B3, assess and interpret all model assumptions of model2. A model is
considered a good fit if all assumptions hold. Based on your interpretation of the model assumptions,
is model2 a good fit?
hp2 = qplot(model2$residuals,
geom="histogram",
binwidth=0.6,
main = "Histogram of Residuals",
xlab = "Residuals",
ylab = "Count",
fill=I("blue"),
alpha=I(0.2))
11
Maximum Channels vs Performance Residual Plot
7 2
6 1
Performance
Residuals
5
0
4
−1
3
2 −2
0 1 2 3 4 5 3 4 5
Maximum Channels Fitted Values
0
y
20
−1
−2
0
−2 −1 0 1 2 −3 −2 −1 0 1 2 3
Residuals x
CPU performance appears to be evenly distributed across the model line for all maximum channel values.
This suggests that the linearity/mean zero assumption holds.
There appears to be homoskedasticity in the residual plot. This suggests that the constant variance assump-
tion holds.
There also does not appear to be any clear pattern or clustering in the residuals. This suggests that the
errors are uncorrelated. We cannot definitely state that the errors are independent because the data came
from an observational study.
Both the histogram and the quantile-quantile plot of the residuals suggests that the normality assumption
holds.
All model assumptions appear to hold using the log-transformed data!
Suppose we are interested in predicting CPU performance when chmax = 128. Please make a prediction
using both model1 and model2 and provide the 95% prediction interval of each prediction on the original
scale of the response, performance. What observations can you make about the result in the context of the
problem?
newcpu = data.frame(chmax=128)
cat("model1:", end="\n")
12
## model1:
cat("model2:", end="\n")
## model2:
When there are a maximum of 128 channels in CPU, model1 predicts a CPU performance of 516.4685 with
a lower bound of 252.2519 and an upper bound of 780.6851 for the 95% prediction interval, while model2
predicts a CPU performance of 277.723 with a lower bound of 55.17907 and an upper bound of 1397.813 for
the 95% prediction interval.
We can see that model2, which uses the log transformation, has a larger prediction interval than model1, the
model using the untransformed data. We can also see that the predicted value is much lower using Model2
than Model1. With that said, both predicted values fall within the prediction intervals of the other model.
Based on the goodness of fit assessments, model1’s prediction interval is likely to be inaccurate. Hence,
model2’s prediction interval seems to be much more reliable than model1’s. However, we might need to
split our data set into training and testing sets and calculate prediction accuracy measurements in order to
further evaluate the prediction accuracy of the models.
Note: Any other logical answer is acceptable for full credit.
1. 2 pts Using data2, create a boxplot of performance and vendor, with performance on the vertical axis.
Interpret the plots.
13
Box Plot of Vendor vs Performance
500
400
vendor
performance
300
honeywell
hp
nas
200
100
0
honeywell hp nas
vendor
The box plot above suggests that CPU performance differs between the vendors. The vendor nas appears
to have CPUs with higher performance than either honeywell or hp.
2. 3 pts Perform an ANOVA F-test on the means of the three vendors. Using an α-level of 0.05, can we
reject the null hypothesis that the means of the three vendors are equal? Please interpret.
The p-value of the F-test is 0.00553, which is less than the α-level of 0.05. We reject the null hypothesis that
the mean CPU performance off all three vendors is equal, and conclude that at least two means statistically
significantly differ from each other.
3. 3 pts Perform a Tukey pairwise comparison between the three vendors (TukeyHSD()). Using an α-level
of 0.05, which means are statistically significantly different from each other?
14
# Your code here...
TukeyHSD(model3, "vendor", conf.level = 0.95 )
Nas-honeywell and nas-hp are the two pairs of vendors that have statistically significantly different means
at the significance level of 0.05 since the p-values of the pairwise comparisons are smaller than the α-level
of 0.05; In fact, the intervals fall completely on the positive side and don’t include zero. In the context of
the problem, we can conclude that the mean CPU performance of nas is significantly higher than the mean
CPU performance of the other two vendors honeywell and hp.
15