Ms 236 N 0
Ms 236 N 0
Ms 236 N 0
3 Polynomial regression 19
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 R example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Worked example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Further reading and exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Multiple Regression 25
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Estimation of β — the normal equations . . . . . . . . . . . . . . . . . . . . 28
4.3 Properties of Least Squares Estimators . . . . . . . . . . . . . . . . . . . . . 31
4.4 Testing the Adequacy of Different Models . . . . . . . . . . . . . . . . . . . 33
4.4.1 F-test for the Overall Significance of Regression . . . . . . . . . . . . 33
4.4.2 F-test for the deletion of a subset of variables . . . . . . . . . . . . . 33
4.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6 Selecting a Regression Equation . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6.1 Problems with choosing the incorrect regression equation . . . . . . . 35
4.6.2 Automatic methods for selecting a regression model . . . . . . . . . . 36
4.6.3 Statistics for Evaluating Regressions. . . . . . . . . . . . . . . . . . . 37
4.7 Outliers and influence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.8 A check for independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1
4.9 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.10 Further reading and exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2
Preface
In this course we are going to extend the ideas of Regression which you met in the first
year. We begin by describing some of the ideas behind any statistical analysis. We then
present a number of useful results. In Chapter 2 we re-examine simple linear regression and
prove the results about it that you have seen. We extend this to polynomial regression in
Chapter 3 and multiple regression in Chapter 4. In Chapter 5 we consider simple designed
experiments, including the analysis of the so-called one-way model which is an extension of
the two sample t-test which you have met before. We also look at randomised block designs
which is an extension of the matched pairs t-test.
Additional reading and exercises are suggested in Linear Statistical Methods: An Applied
Approach, 2nd Edition by B L Bowerman and R T O’Connell (PWS-Kent) hereafter referred
to as BOC. Other books are as follows:
D C Montgomery and E A Peck Introduction to Linear Regression Analysis (2nd ed.)
Wiley
J Neter, W Wasserman and M H Kutner Applied Linear Statistical Models (3rd ed.) Irwin
D C Montgomery Design and Analysis of Experiments (3rd ed.) Wiley
3
4
Chapter 1
In this chapter we are going to discuss some of the ideas of Statistical Modelling. We start
with a real life problem for which we have some data. We think of a statistical model as a
mathematical representation of the variables we have measured. This model usually involves
some parameters. We may then try to estimate the values of these parameters. We may
wish to test hypotheses about the parameters. We may wish to use the model to predict
what would happen in the future in a similar situation. In order to test hypotheses or make
predictions we usually have to make some assumptions. Part of the modelling process is to
test these assumptions. Having found an adequate model we must compare its predictions,
etc. with reality to check that it does seem to give reasonable answers.
We can illustrate these ideas using a simple example. Suppose that we are interested in
some items which are manufactured in batches. The size of the batch and the time to make
the batch in man hours are recorded, see Table 1.1. We can plot these points on a scatter
diagram, see Figure 1.1.
From the scatter diagram we can see that a straight line model would be appropriate,
Y = α + βx + error, where α represents the time to set up the equipment to manufacture
a batch and β represents the time taken to make each item. You know that the estimate of
this line is α
b + βx
b where
Sxy
b = ȳ − βbx̄, βb =
α .
Sxx
We shall derive these results in Chapter 3. It follows that for our data α b = 10 and βb = 2.
How can we check the assumptions in our model? You have seen before that we can
calculate the residuals ei = yi − (α b + βx
b ). Plots of the residuals can be made to check
i
various assumptions. Note that we should not make predictions from our model outside the
range of data on which it is fitted. For example, in this case batches of size more than eighty
might cause the machinery to overheat and the straight line relationship between y and x
would no longer hold.
5
x (batch size) y (man-hours)
30 73
20 50
60 128
80 170
40 87
50 108
60 135
30 69
70 148
60 132
Table 1.1: Data on batch size and time to make each batch
y 6
?
?
?
?
?
?
??
-
x
6
Chapter 2
Yi = α + βxi + εi for i = 1, 2, . . . , n.
Note that the xi ’s are constants in the model, the values of which are either of a controlled
variable (e.g. batch size) or regarded as constant because they are measured with virtually
no error.
I shall follow, as far as possible, the convention that capital letters are used for random
variables and small letters for observed values of the random variable. Thus I write the
model above with a Yi denoting that it is a random variable. This rule is generally followed,
but an exception is made for quantities represented by greek letters, where both the random
variable and its observed value are represented by the same small letters. Thus εi is a random
variable.
As we discussed in Chapter 1 we have to make some assumptions about the errors εi .
The minimum assumption necessary to justify use of the least squares estimates is that they
have expectation zero, constant variance σ 2 and that they are uncorrelated. In order to
make any statistical inferences about our model, we have to make an assumption about the
form of the distribution of the errors. Consequently we usually assume that the errors are
independent, normal random variables.
The two assumptions together imply that the errors are independent with distribution
εi ∼ N (0, σ 2 ).
Equivalently, using the model, we can say that the observations are independent with dis-
tribution
Yi ∼ N (α + βxi , σ 2 ). (2.1)
7
and the fitted straight line α + βxi given by
n
[yi − α − βxi ]2 .
X
S=
i=1
To find the least squares estimators we must minimise S. To do this we find the solutions
to the equations
∂S/∂α = 0 and ∂S/∂β = 0.
Now
n
∂S X
= −2 (yi − α − βxi ),
∂α i=1
n
∂S X
= −2 xi (yi − α − βxi ).
∂β i=1
These equations are called the normal equations. They are two linear equations in two
unknowns and have a unique solution provided (xi − x̄)2 > 0. If we multiply (2.3) by n
P
P
and (2.2) by xi and subtract we obtain the estimate for β as
xi y i − xi y i
P P P
n
βb = P 2 . (2.4)
n xi − ( xi ) 2
P
We may express (2.4) in other useful ways. For calculating βb with a calculator we usually
use
xi yi − ( xi yi )/n
P P P
β= P 2
b .
xi − ( xi )2 /n
P
8
The quantity S is the sum of squared distances about the line. The minimum value of S
denoted by Smin is given by
n
b )2 .
X
Smin = (yi − α
b − βx i
i=1
b = ȳ − βbx̄ gives
Substituting for α
n
2
X
Smin = [(yi − ȳ) − β(x
b
i − x̄)]
i=1
n n n
2
− ȳ) + βb2 (xi − x̄)2 .
X X X
= (yi − ȳ) − 2βb (xi − x̄)(yi
i=1 i=1 i=1
Sxy
Substituting for βb = Sxx
gives
[Sxy ]2
Smin = Syy −
Sxx
Pn
where Syy = i=1 (yi − ȳ)2 .
Note that in this definition θb is a random variable. We must distinguish between θb when
it is an estimate and when it is an estimator. As a function of the observed values yi it is
an estimate, as a function of the random variables Yi it is an estimator.
The parameter estimator βb can be written as
n
X xi − x̄ xi − x̄
βb = ai Yi , where ai = Pn 2
= . (2.5)
i=1 i=1 (xi − x̄) Sxx
We have assumed that Y1 , Y2 , . . . , Yn are normally distributed and hence using the result that
any linear combination of normal random variables is also a normal random variable, βb is also
normally distributed. We now derive the mean and variance of βb using the representation
(2.5).
n
X
E[β]
b = E[ ai Yi ]
i=1
n
X
= ai E[Yi ]
i=1
Xn
= ai (α + βxi )
i=1
n
X n
X
= α ai + β ai x i
i=1 i=1
9
(xi − x̄)xi = Sxx , so E[β]
P P P
but ai = 0 and ai xi = 1 as b = β. Thus βb is unbiased for β.
Also
" n #
X
[β]
b = ai Yi
i=1
n
a2i [Yi ] since the Y ’s are independent
X
=
i=1
n
σ 2 (xi − x̄)2 /[Sxx ]2
X
=
i=1
= σ 2 /Sxx .
Note that
b = Ȳ − βbx̄
α
n n
1X X
= Yi − x̄ ai Yi
n i=1 i=1
n
1
X
= Yi − ai x̄
i=1 n
where ai = (xi − x̄)/Sxx . Thus the parameter estimator α b = Ȳ − βbx̄ is also a linear
combination of the Yi ’s and hence α
b is normally distributed.
We can also find the mean and variance of α.b
E[α]
b = E[Ȳ ] − E[βbx̄]
= α + β x̄ − x̄E[β]b
= α + β x̄ − β x̄
= α.
Thus α
b is also unbiased. Its variance is given by
" n #
X
[α]
b = b i Yi
i=1
i=1
n
= σ 2 [1/n − 0 + (xi − x̄)2 x̄2 /{Sxx 2 }]
X
i=1
2 2
= σ [1/n + x̄ /Sxx ].
10
2.4 Fitted Values and Residuals
The fitted values µb i , for i = 1, 2, . . . , n are the estimated values of E[Yi ] = α + βxi , so
µb i = α
b + βx
b .
i
ei = Yi − (αb + βx
b )
i
= Yi − Ȳ − β(xi − x̄).
b
n
X n
X n
X
ei = (Yi − Ȳ ) − βb (xi − x̄)
i=1 i=1 i=1
= 0.
E[ei ] = E[Yi − αb − βx
b ]
i
= E[Yi ] − E[α] − xi E[β]
b b
= α + βxi − α − βxi
= 0.
Now we know that [εi ] = σ 2 and [εi , εj ] = 0. So the crude residuals ei do not quite mimic
the properties of εi .
11
Then a normalised residual bigger than 2 is surprising and a normalised residual bigger than
3 is very surprising, as we have assumed that the εi are normal with variance σ 2 .
4. i (serial order) e.g. time – a systematic pattern indicates serial correlation which con-
tradicts [εi , εj ] = 0 for i 6= j.
2.5 Estimating σ 2
To find s2 , an unbiased estimate of σ 2 , we need to calculate E[Smin ]. Now
b )2
X
Smin = (Yi − α
b − βx i
e2i .
X
=
Thus
e2i ] = E[e2i ]
X X
E[Smin ] = E[
X
= [ei ] since E[ei ] = 0
Xh i
= 1 − 1/n − (xi − x̄)2 /Sxx σ 2
= σ 2 [n − 1 − (xi − x̄)2 /Sxx ]
X
= σ 2 (n − 2).
Thus we see that Smin /(n − 2) is an unbiased estimate of σ 2 . We therefore define the residual
mean square as
Smin
s2 = .
n−2
12
For calculation purposes we usually use
[Sxy ]2
Smin = Syy − .
Sxx
This expresses the variability between the residuals as the variability among the original
observations less the variability of the fitted values. We can rewrite the expression as
Syy = βb2 Sxx + Smin
or
Total sum of squares is equal to the sum of squares due to the regression plus
the residual sum of squares.
The total sum of squares
n
(yi − ȳ)2
X
Syy =
i=1
is the residual sum of squares for the null model
Y i = µ + εi , i = 1, 2, . . . , n.
Thus we are fitting the same constant to each observation. For the null model the fitted
values are µb = ȳ and hence the residual sum of squares for this model is
n
(yi − ȳ)2 .
X
Syy =
i=1
13
Similarly for α, it follows that
b−α
α
!1 ∼ N (0, 1)
1 x̄2 2
σ +
n Sxx
βb
1
s/[Sxx ] 2
which has a t distribution with n−2 degrees of freedom. Because of the relationship between
a t distribution and an F distribution, namely that if T ∼ tν then T 2 ∼ Fν1 we also have
that
βb2 Sxx 1
∼ Fn−2 .
s2
This may be rewritten as
Source SS d.f. MS F
βb2 Sxx
Regression βb2 Sxx 1 βb2 Sxx 2
s
Residual Smin n−2 s2
Total Syy n−1
2.7 Prediction
Suppose that we wish to predict a new value of Y for a given value of x = x0 , say. The linear
predictor is
Yb = α
b + βx 0 − x̄).
b = Ȳ + β(xb
0
14
1. We wish to predict the expected response (or average response) at x0 .
Now
E[α
b + βx
b ] = E[α]
0 b + x0 E[β]
b
= α + βx0 .
Also
[α
b + βx 0 − x̄)]
b ] = [Ȳ + β(x
b
0
2 b
= [Ȳ ] + 2[Ȳ , β(x 0 − x̄)] + (x0 − x̄) [β]
b
1 (x0 − x̄)2
!!
2
α
b + βx
b
0 ∼ N α + βx0 , σ + .
n Sxx
Therefore the 100(1 − α)% confidence interval for the predicted mean in case 1 is
!1
b ±t α 1 (x0 − x̄)2 2
α
b + βx 0 n−2 ( )s + .
2 n Sxx
For an individual in case 2 we have to take account of not only the variation of the
fitted line about the true line, but also the variation of the individual about the line. There
is therefore an extra variation of σ 2 to be taken into account. Therefore the 100(1 − α)%
confidence interval for an individual is
!1
α 1 (x0 − x̄)2 2
α + βx0 ± tn−2 ( )s 1 + +
b b .
2 n Sxx
Predictions are more precise, i.e. the variance is smaller, the more of the following are true.
1. s2 is small
2. n is large
4. Sxx is large.
15
2.8 R example
A fire insurance company wants to relate the amount of fire damage in major residential fires
to the distance between the residence and the nearest fire station. The study was conducted
in a large suburb of a major city; a sample of fifteen recent fires in the suburb was selected.
The amount of damage, y, and the distance, x, between the fire and the nearest fire station
are given in the following table.
x y x y
3.4 26.2 2.6 19.6
1.8 17.8 4.3 31.3
4.6 31.3 2.1 24.0
2.3 23.1 1.1 17.3
3.1 27.5 6.1 43.2
5.5 36.0 4.8 36.4
0.7 14.1 3.8 26.1
3.0 22.3
These data were read into R and analysed using the regression commands. A preliminary
plot of the data indicated that a straight line model was appropriate. The output in the
session window was as follows.
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-3.4682 -1.4705 -0.1311 1.7915 3.3915
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.2779 1.4203 7.237 6.59e-06 ***
x 4.9193 0.3927 12.525 1.25e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
aov(formula = fire.fit)
Terms:
x Residuals
Sum of Squares 841.7664 69.7510
16
Deg. of Freedom 1 13
In this example the intercept or constant term represents the minimum damage caused
by a fire however quickly the fire brigade take to arrive. The x term represents the extra
damage caused by the fact that the distance from the nearest fire station is highly correlated
with the time it takes the fire brigade to arrive at the fire. We see from both the ANOVA
table and the T value for x that there is overwhelming evidence that the damage is related
to the distance from the fire station. Plots of standardised residuals versus fitted values
and x (not shown here) and the normal plot give us no cause to suspect that the model
assumptions are not reasonable.
Solution.
(xi − x̄) = 0.
P
which implies that α
b = ȳ since
(xi − x̄)2
X X X
yi (xi − x̄) = α
b (xi − x̄) + βb
(b) Now
1X 1X 1X
E[α]
b = E[ Yi ] = E[Yi ] = {α + β(xi − x̄)} = α
n n n
and
(xi − x̄)E[Yi ] (xi − x̄)α (xi − x̄)2 β
P P P
E[β]
b = = + = β.
Sxx Sxx Sxx
17
Also
1X 1 X 1 σ2
[α]
b =[ Yi ] = 2 [Yi ] = 2 nσ 2 =
n n n n
and
(xi − x̄)2 [Yi ] (xi − x̄)2 σ 2 σ2
P P
[β]
b = 2
= 2
= .
Sxx Sxx Sxx
An alternative way of solving this problem would be to write
18
Chapter 3
Polynomial regression
3.1 Introduction
As mentioned in Chapter 2, if the plot of standardised residuals versus x shows a systematic
pattern with, for example, mainly negative residuals for small and large values of x and
mainly positive residuals for moderate values of x this may indicate that higher powers of x
should be added to the model. We thus obtain a polynomial regression model. For example
a quadratic regression model is of the form
Yi = α + βxi + γx2i + εi , i = 1, . . . , n.
It may be best to centre the xi ’s by subtracting x̄, but this is not necessary. We may derive
the least squares estimates of α, β and γ as follows.
We have to minimise S = (yi − α − βxi − γx2i )2 . If we differentiate S with respect to
P
α, β and γ and set it equal to zero, we see that the normal equations are
x2i
X X X
y i = nα
b + βb xi + γb (3.1)
x2i x3i
X X X X
xi y i = α
b xi + βb + γb (3.2)
x2i yi x2i x3i x4i .
X X X X
= α
b + βb + γb (3.3)
From (3.1) P 2
x
α
b = ȳ − βbx̄ − γb .
n
Substituting for α
b in (3.2) and (3.3) we have
Sxy = βS
b
xx + γ
b Sxx2
Sx2 ,y = βS
b
x,x2 + γ
b Sx2 ,x2
and
Sx2 ,y Sxx − Sxy Sx,x2
γb = 2
.
Sx2 ,x2 Sxx − Sx,x 2
19
3.2 R example
An analyst for a cafeteria chain wishes to investigate the relationship between the number
of self-service coffee dispensers in a cafeteria and sales of coffee. Fourteen cafeterias that
are similar in respects such as volume of business, type of clientele and location are chosen
for the experiment. The number of dispensers varies from zero (coffee is only dispensed by
a waitress) to six and is assigned randomly to each cafeteria. The results were as follows,
(sales are measured in hundreds of gallons of coffee sold).
The analyst believes that the relationship between sales and number of self-service dispensers
is quadratic in the range of observations. Sales should increase with the number of dispensers,
but if the space is cluttered with dispensers this increase slows down.
Fit a quadratic regression to these data and check the assumptions of the model. Try
also a cubic term in the model and test if this is necessary.
The data were read into R. The original variables were called ‘sales’ and ‘disp’. Variables
‘dispsq’ and ‘dispcub’ representing the square and cube of the number of dispensers were
created using the calculator. The output in the session window resulting from fitting the
quadratic model was as follows.
Call:
lm(formula = sales ~ disp + dispsq)
Residuals:
Min 1Q Median 3Q Max
-10.6643 -5.6622 -0.4655 5.5000 10.6679
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 502.5560 4.8500 103.619 < 2e-16 ***
disp 80.3857 3.7861 21.232 2.81e-10 ***
dispsq -4.2488 0.6063 -7.008 2.25e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
20
Call:
aov(formula = disp.fit)
Terms:
disp dispsq Residuals
Sum of Squares 168740.64 3032.80 679.22
Deg. of Freedom 1 1 11
The plot of residuals versus fitted values indicates no evidence against homogeneity of
variance. The normal plot and test gives no evidence against the normality assumption. A
quadratic model fits very well. There is clear evidence (T = −7.01) that the quadratic term
is needed. The model should not, however, be used outside the range where it was fitted for
prediction purposes.
Call:
lm(formula = sales ~ disp + dispsq + dispcub)
Residuals:
Min 1Q Median 3Q Max
-12.7429 -3.8464 -0.2369 6.6857 8.4179
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 500.3060 5.3586 93.365 4.86e-16 ***
disp 87.8857 8.4630 10.385 1.12e-06 ***
dispsq -7.6238 3.4590 -2.204 0.0521 .
dispcub 0.3750 0.3784 0.991 0.3450
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
21
The cubic term is not needed in the model. The cubic term has T = 0.99 on 10 d.f. or
F = 61
62
on 1, 10 d.f.
(a) Assuming a simple linear regression model find the least squares estimates of α and β.
(b) Find the standard errors of the estimates.
(c) Find a 95% confidence interval for the mean value of Y when x = 56.
(d) On analysing these data a researcher tried fitting successively the null model, a sim-
ple linear regression model and a model also including a quadratic term in x. The
resulting residual sums of squares were 23379, 952.25 and 821.46. Write down an ap-
propriate analysis of variance table and test the hypothesis that the quadratic term is
unneccessary.
x2 = 31488, y 2 = 897639.
X X
Thus we have
Sxx = 3840, Syy = 23378.917, Sxy = 9280
and
βb = 2.4167, α
b = 153.917.
22
Source SS d.f. MS F
Linear Regression 22427. 1 22427. 245.71
Quadratic Regression 130.8 1 130.8 1.433
Residual 821.46 9 91.273
Total 23379. 11
From tables F91 (0.1) = 3.36 so there is no evidence against the hypothesis that the quadratic
term is not needed.
23
24
Chapter 4
Multiple Regression
4.1 Introduction
In this chapter we are going to study regression problems with several possible explanatory
variables.
Suppose we have some observations on a variable Y and two other variables x and z
which we think are related to Y . We could first fit a simple linear regression model to Y
and x,
Y = α + βx + ε.
Now suppose we plot the residuals against z, and suppose that there seems to be a linear
relation between them. This would suggest a model
Y = α + βx + γz + ε.
This is called a multiple regression model with two explanatory variables. If there were other
possible explanatory variables, we could again find the residuals
ei = yi − α
b − βx
b −γ
i b zi
This model has p unknown parameters β0 , β1 , . . . , βp−1 . We make assumptions about the
errors similar to those for the simple linear regression model, namely that the εi are inde-
pendent and normally distributed with mean zero and constant variance σ 2 .
25
It is often convenient to write the multiple regression model in matrix form as
β0
Y1 1 x11 x21 · · · · · · · · · xp−1,1 ε1
β1
Y2 1 x12 x22 · · · · · · · · · xp−1,2 ε2
β2
.. .. .. .. .. .. .. .. ..
..
. . . . . . . . .
..
=
.. .. .. .. .. .. ..
. +
..
. . . . . . . . .. .
..
.. .. .. .. .. .. ..
.
..
.
. . . . . . . ..
.
.
Yn 1 x1n x2n · · · · · · · · · xp−1,n εn
βp−1
We often write this matrix formulation as
Y = Xβ + ε (4.1)
where Y is an (n × 1) vector of observations, X is an (n × p) matrix called the design
matrix, β is a (p × 1) vector of unknown parameters and ε is an (n × 1) vector of errors.
This formulation (4.1) is usually called the General Linear Model.
The ith row of X [call it xTi ] is the explanatory data for the ith observation, (i = 1, . . . , n).
The jth column of X is the vector of observations for the explanatory variable xj−1 , for
j = 1, . . . , p, (where x0 is the “constant explanatory variable” which is always equal to 1).
All the models we have considered so far can be written as a General Linear Model.
1. The null model (p = 1).
Yi = µ + εi for i = 1, . . . , n
is equivalent to
Y = 1µ + ε
where 1 is an (n × 1) vector of 1’s.
2. Simple linear regression (p = 2).
Yi = α + βxi + εi for i = 1, . . . , n
can be written as a General Linear Model with
1 x1
1 x2
" #
α
X= .. .. β= .
. .
β
1 xn
26
4. Multiple regression with two explanatory variables (p = 3).
We shall only consider the case when the rank of X that is the number of linearly inde-
pendent columns is equal to p. This means that no column (i.e. no explanatory variable) can
be expressed as a linear combination of the other columns (i.e. other explanatory variables).
This condition is satisfied for many data sets in multiple regression problems.
The error assumptions given above can be written
[εi , εj ] = 0 for i 6= j.
Thus we obtain the following
E[Yi ] = xTi β
which can be written
E[Y ] = Xβ.
Also
[Yi ] = σ 2 for i = 1, . . . , n and [Yi , Yj ] = 0 for i 6= j
which can be written
[Y ] = σ 2 I n
where [Y ] is defined to be the matrix
Yi ∼ N (xTi β, σ 2 ) for i = 1, . . . , n
27
P
E@
I
E@
E @
E @
E @
@R
E @
E :
E
E
E
h
E
O hhhhhhh
hhhh E
hE
Q
28
Rearranging this equation we see that
X T y = X T X β.
b
This is a system of p equations in p unknowns, which are called the normal equations.
As rank(X) = p, it is possible to show that the rank(X T X) = p, so the (p × p) matrix
T
X X is non-singular. It follows that the unique solution to the normal equations is given
by
βb = (X T X)−1 X T y
As an example we shall show that we obtain the same estimates as in Chapter 2 for
simple linear regression if we use the matrix formulation.
1 x1
1 x2
X= .. ..
. .
1 xn
y1
y2
" #
1 1 ··· 1
T
X y = ..
x1 x2 · · · xn
.
yn
" P #
y
= P i
xi y i
" P #
n x
X X=T
P P 2i
xi xi
The determinant of X T X is given by
|X T X| = n x2i − ( xi )2 = nSx,x .
X X
29
which is the same result as we obtained in Chapter 3.
The fitted values in Figure 4.1 are given by
µ
b = Xβ ~
b = OQ.
total sum of squares is equal to sum of squares due to regression plus residual
sum of squares.
We can illustrate this for simple linear regression. The fitted values are
!
α
xTi β
b
µ
bi = b = 1 xi =α
b + βx
b .
i
βb
30
in agreement with our results in Chapter 3.
To summarise for the General Linear Model
y = Xβ + ε
the least squares estimates are given by
b = (X T X)−1 X T y
β
the fitted values by
µ
b = Xβ
b
the residuals by
e=y−µ
b
and the residual sum of squares by
T
Smin = eT e = y T y − βb X T y.
Theorem 4.1 The least squares estimator βb is an unbiased estimate of β with variance-
covariance matrix
b = σ 2 (X T X)−1 .
[β]
31
1. Fitted values to check homogeneity of variance.
2. Normal order statistics to check normality.
3. Explanatory variables in the model to check for non-linear terms.
4. Possible explanatory variables not in the model to check if they might be included.
5. The serial order to check for correlated errors.
Note that s2 is unbiased only when the model is adequate, i.e. when all the relevant
explanatory variables are included – otherwise it is upwardly biased.
We can check that the theorems give us the same results as we found in Chapter 2 for
simple linear regression. We have seen on page 29 that
" P #
1 x2 −
P
T −1 xi
(X X) = Pi .
nSx,x − xi n
b = σ 2 (X T X)−1 . Thus
Now, by Theorem 4.1, [β]
P 2 2
xσ i
[α]
b =
nSx,x
n o
P 2 P 2 1 x̄2
which, by writing x = x − nx̄2 + nx̄2 , can be writen as σ 2 n
+ Sx,x
as before. Also
" #
b = σ − xi
P
2
[α,
b β]
nSx,x
2
−σ x̄
= ,
Sx,x
and
σ2
[β]
b = .
Sx,x
The quantity vj is given by
vj = xTj (X T X)−1 xj
" P # !
1 x2 −
P
xi 1
= (1, xj ) Pi .
nSx,x − xi n xj
We shall leave it as an exercise to show that this simplifies to
1 (xj − x̄)2
vj = + .
n Sx,x
32
4.4 Testing the Adequacy of Different Models
4.4.1 F-test for the Overall Significance of Regression
Suppose we wish to test the hypothesis
H0 : β1 = β2 = . . . = βp−1 = 0
(i.e. all coefficients except β0 are zero) versus
H1 : at least one of the coefficients is non-zero.
Under H0 , the model reduces to the null model
Y = 1µ + ε
, i.e. in testing H0 we are asking if there is sufficient evidence to reject the null model.
Consider the statistic F ? defined by
(regression SS)/(p − 1)
F? =
s2
where s2 = Smin /(n − p) and so
unbiased estimate of σ 2 (if H0 is true)
F? = .
unbiased estimate of σ 2 (always true)
Thus if H0 is true, F ? ≈ 1 and large values of F ? indicate departures from H0 . Under H0 ,
F ? has an F distribution with degrees of freedom p − 1 and n − p and a test at the α level
for H0 is given by rejecting H0 (ie “the overall regression is significant”) if
p−1
F ? > Fn−p (α)
The ANOVA (Analysis of Variance) table is
SS
Source d.f. SS MS= d.f. F?
SS SS
Overall regression p − 1 SS(x1 , x2 , .., xp−1 ) p−1 p−1
/s2
2
Residual n−p Smin RSS s
Total n−1 S(y,y) TSS
33
SS(xq , . . . , xp−1 |x1 , . . . , xq−1 ) = SS(x1 , x2 , . . . , xp−1 ) − SS(x1 , x2 , . . . , xq−1 )
regression SS for regression SS for
full model (f) reduced model (r)
= RSSr − RSSf
RSS under RSS under
reduced model full model.
To determine whether the change in sum of squares is significant, we must test the hypothesis
H0 : βq = βq+1 = . . . = βp−1 = 0
versus
H1 : at least one of these is non-zero.
It can be shown that, if H0 is true,
In the ANOVA table we use the notation xq , . . . , xp−1 |x1 , . . . , xq−1 to denote that this is the
effect of the variables xq , . . . , xp−1 given that the variables x1 , . . . , xq−1 are already included
in the model.
Note that the F-test for the inclusion of a single variable xp−1 , (this is the case q = p − 1)
can also be performed by an equivalent t-test where
βbp−1
t= ,
se(βbp−1 )
where se(βbp−1 ) is the estimated standard error of βbp−1 . We compare the value t with tn−p
for a two-sided test of H0 : βp−1 = 0. In fact F ? = t2 .
34
4.5 Prediction
As for simple linear regression, once the model is accepted we can use it to predict
1. Mean response,
2. Individual response.
Suppose we are trying to predict when the values of the explanatory variables are
Yb = xT0 β.
b
The predicted value Yb has mean xT0 β and variance σ 2 xT0 (X T X)−1 x0 , which is estimated
by s2 xT0 (X T X)−1 x0 .
Under the normality assumption, it can be shown that Yb is normal and independent of
s2 . The interval
1 T −1 1
xT0 β
b ± st T
n−p ( α)[x0 (X X) x0 ] 2
2
is a 100(1 − α)% confidence interval for the mean response. As before when we are trying to
predict an individual observation there is greater uncertainty and the corresponding interval
in this case is
1 T −1 1
xT0 β
b ± st T
n−p ( α)[1 + x0 (X X) x0 ] 2 .
2
35
squares. A large model is expensive to monitor and use for prediction since a large number of
explanatory variables must be measured for each observation. If we fit too many variables the
model may be invalid in the desired range of prediction. The decrease in bias is insufficient
to compensate for the increase in variance (for estimation or prediction) causing an increase
in mean square error (MSE) which is defined by
1. Fit the multiple regression model with all explanatory variables. (Note that, if some
variables are close to being linear combinations of others, a high level of computational
accuracy is required to ensure that X T X, a near singular matrix, can be inverted to
give sensible results.)
3. Compare the lowest value of the statistic with the predetermined significance level (say
α = 0.05) and omit it if necessary. (Note that the lowest value of the F statistic will
correspond with the variable with lowest value of |estimate|/standard error).
4. We now have a new model with one fewer variables than we started with. We return
to step 2 and calculate the statistic of the exclusion of each variable still in the model.
A procedure which works in the opposite way to backwards elimination is called Stepwise
regression or Modified ‘forward regression’.
2. Introduce the explanatory variable with the highest F or t-value for inclusion.
3. After each inclusion, test the other variables in the current model to see whether they
can be omitted. Use say α = 0.10 for omission and α = 0.05 for inclusion.
36
A difficulty is that, at each stage, the estimate of σ 2 used may be upwardly biased and
hence not reliable. This can be overcome by using s2 based on all the variables, if this is
computationally feasible.
Another possibility is to leave out step 3 i.e. never omit a variable once it has been
included.
Backwards elimination and Stepwise regression are illustrated in a worked example in
Section 4.9.
1. Consider the residual mean square s2 . The behaviour of s2 is such that as the number
of regressors increases, s2 steadily decreases to a relatively stable minimum and then
irregular behaviour. Thus a plot of s2 versus the number of unknown parameters may
indicate the appropriate number of explanatory variables.
R2 = 1 − Smin /S(y,y) .
s2
R̄2 = 1 − (n − 1) .
S(y,y)
With this definition, an extra variable included only increases R̄2 if its F -value is
greater than 1. However maximising R̄2 is equivalent to minimising s2 so it does not
add anything.
37
4. Compare these models using s2 and Mallow’s Ck .
5. Write a final verbal report drawing conclusions and making recommendations
Using the R command ‘Best subsets regression’ it is possible to find the variables cor-
responding to the smallest residual sum of squares for 1, 2 ,. . . regressors together with the
values of s2 , Ck , R2 and R̄2 . In deciding which model to adopt we do not necessarily take
the one with the minimum value of s2 or Ck . Look if Ck is close to k. Also fit the model
and see if some terms are not significant. The point is that there is really no right answer.
A number of different models will usually fit the model well and give similar predictions.
Deciding which to use is as much an art as a science. If in doubt it is probably better to
choose a model with fewer variables.
so we would expect vi ≈ np . Hence if some vi are larger than 2 np this means these observations
have quite high leverage and if any vi are larger than 3 np then they have a very high leverage.
Example Suppose we consider a simple linear regression model with xi = i for 1, 2, . . . , 10.
1 1
1 2 10 55 1 385 −55
X=
.. ..
(X T X) = (X T X)−1 =
. . 55 385 825 −55 10
1 10
1 x1
38
vi = (385 − 110xi + 10x2i )/825
So for i = 1, 10 the values of vi are 0.346 0.248 0.176 0.127 0.103 0.103 0.127 . . . 0.346. Now
suppose we move point x10 from 10 to 20.
10 65 1 685 −65
T T −1
(X X) = (X X) =
65 685 2625 −65 10
This can be thought of as a scaled distance between βb and βb(i) which is the estimate of β
omitting the ith observation. This would indicate that we need to fit n + 1 regressions, one
using the full data set and n regressions using different reduced data. This is not the case,
since we can rewrite Di as
1 vi
Di = t2
p (1 − vi ) i
1
where ti = ei /[s2 (1 − vi )] 2 is the ith standardised residual.
An observation with a large value of Di compared to the other observations is influential
in determining the parameters of the model. If this observation is omitted, the estimates of
the parameters will change greatly. To decide what is ‘large’, a useful cut-off point is the
p
50% point of an F with p and n − p degress of freedom, Fn−p (.50). This value can be found
using R.
where e1 , . . . , en are the time ordered crude residuals. To test the null hypothesis H0 versus
H1 that the errors are positively autocorrelated the form of the test is as follows:
39
If d < dL,α reject H0 at the α level
If d > dU,α don’t reject H0 at the α level
If dL,α < d < dU,α the test is inconclusive.
The values dL,α and dU,α may be found in suitable tables for example in BOC.
Example Consider the steam data of Practical 2. This is an example of a data set collected
in time order. If we fit the model with the explanatory varaibles X and FA the value of the
Durbin-watson statistic is 1.85. From the tables in BOC for n = 25, k − 1 = 2 dL,0.05 = 1.21
and dU,0.05 = 1.55 so we don’t reject the hypothesis of independence at the 5% level.
To test H0 versus an alternative that the errors are negatively autocorrelated the form
of the test is:
If 4 − d < dL,α reject H0 at the α level
If 4 − d > dU,α don’t reject H0 at the α level
If dL,α < 4 − d < dU,α the test is inconclusive.
To test H0 versus a general alternative that the errors are autocorrelated the form of the
test is If d < dL,α/2 or 4 − d < dL,α/2 reject H0 at the α level
If d > dU,α/2 or 4 − d > dU,α/2 don’t reject H0 at the α level
Otherwise the test is inconclusive.
Note that this test relies heavily on the underlying data being normally distributed. In
practice positive autocorrelation is much more common than negative autocorrelation.
For data which is collected in time order, for example many economic data sets, it is
important to check that the errors are not autocorrelated. Otherwise a large value of R2 and
significant regressors may be found when there is really no relationship.
x2 x22
" #
β1
X= .. .. , β = β .
. .
2
2
xn xn
It follows that
" P P 3 # " P 3 #
x2
P 4
x 1 x − x
X X= T
P 3 P 4 and (X X)−1
T
= P 3 P 2
x x ∆ − x x
P 2P 4
− ( x3 ) 2 .
P
where ∆ = x x Also
" P #
T xY
X Y = P 2
xY
40
so that
" #
βb1
= βb = (X T X)−1 X T Y
βb2
X 4X
x3 x2 Y
X X
x xY −
= X ∆ X X .
x2 x2 Y − x3
X
xY
∆
b = σ 2 (X T X)−1 so
We know that [β]
. .
[βb1 ] = σ 2 x4 ∆ and [βb1 , βb2 ] = −σ 2 x3 ∆.
X X
x4
"P P 3P 2 #
xy −
P
x xy
E[βb1 ] = E
∆
X
x4 x(β1 x + β2 x2 ) − x3 x2 (β1 x + β2 x2 ) /∆
X X X
=
x4 x2 − ( x3 )2 ) − β2 ( x4 x3 − x3 x4 ) /∆
X X X X X X X
= β1 (
= β1
41
(d) Compare the ‘best’ regressions on two and three variables using Mallow’s Ck statistic.
Solution. There are 54 observations and the full model has 4 explanatory variables so
0.1098
s2 = = 0.00224.
49
(a) For backward elimination we test if we can exclude the variable whose exclusion in-
creases the residual sum of squares least. Thus the first candidate for exclusion
is X4 . The F value is (0.1099 − 0.1098)/s2 = 0.044 and, as this is less than 1,
we can certainly exclude X4 . Once X4 has been excluded then excluding the vari-
able X1 produces the smallest increase in residual sum of squares. The F value is
(0.7431 − 0.1099)/s2 = 282.6, so there is clear evidence against the hypothesis that X1
should be excluded and hence our final model is X1 + X2 + X3 .
(b) For stepwise regression we start with the null model and test if we should include the
variable whose inclusion reduces the residual sum of squares most.
The first candidate for inclusion is X4 . The F value is (3.973−1.878)/s2 = 934.9, so we
must include X4 . Next X3 produces the biggest reduction in residual sum of squares
from the model with just X4 ; the F value is (1.878 − 1.245)/s2 = 282.5, so we must
also include X3 . Then X2 produces the biggest reduction in residual sum of squares
from X3 + X4 ; the F value is (1.245 − 0.465)/s2 = 348.1, so we have to include X2 .
We must now consider whether, with X2 included, X3 or X4 could be omitted. The F
values are (1.392 − 0.465)/s2 = 413.7 and (0.7431 − 0.465)/s2 = 124.1, so we cannot
exclude either. Now can we include X1 ? The F value is (0.465 − 0.1098)/s2 = 158.5,
so we must include X1 . Now we can exclude X4 as for backward elimination but no
other variables, so the final model is X1 + X2 + X3 .
(c) If we use forward fitting but never exclude variables then our final model will be
X1 + X 2 + X3 + X 4 .
Although X4 is the single most informative variable, it is unnecessary when the others
are included in the model. This illustrates why forward fitting is not to be recom-
mended.
(d) The best model on two variables is X2 + X3 with k = 3 The value of Mallow’s Ck is
given by
0.7431
C3 = − 54 − 6 = 283.6.
s2
Similarly for three variables the best model is X1 + X2 + X3 with
0.1099
C4 = − 54 − 8 = 3.04.
s2
Clearly the second model is to be preferred.
42
Chapter 5
5.1 Introduction
In this chapter we discuss experiments whose main aim is to study and compare the effects
of treatments (diets, varieties, doses) by measuring response (yield, weight gain) on plots or
units (points, subjects, patients). In general the units are often grouped into blocks (groups,
sets, strata) of ‘similar’ plots or units.
For the analysis we use the general theory of the linear model (Chapter 4). We assume
that we have a response Y which is continuous (e.g. a normal random variable). We also
have possible explanatory variables which are discrete or qualitative called factors. Here we
consider treatment factors whose values are assigned by the experimenter. The values or
categories of a factor are called levels, thus the levels of a treatment factor are labels for the
different treatments.
the jth unit receiving the ith treatment, µ is an overall mean effect, αi is the effect due
to the ith treatment and εij is random error. We have the usual least-squares assumptions
that the εij ’s have zero mean and constant variance σ 2 and are uncorrelated. In addition,
we usually make the normality assumption that the εij ’s are independent normal random
variables. Together these imply that
Yij ∼ N (µ + αi , σ 2 )
43
and the Yij ’s are independent. We may note the following:
1. If treatments act multiplicatively on the response, so that Yij = µαi εij and the variance
of Yij is non-constant, then by taking logs of the response, we may obtain a valid one-
way model.
2. Units, even those given the same treatment, are assumed to be uncorrelated.
n r1 r2 . . . rt
r r1 0 ... 0
1
XT X =
r2 0 r2 ... 0
.. .. .. . . . ..
. . . .
rt 0 0 . . . rt
P P
yij G
i j
P y T
Pj 1j 1
XT y = j y2j T2 ,
=
.. ..
.
.
Tt
P
j ytj
where G is the grand total and Ti is the total for the ith treatment.
The normal equations are therefore (note that we could also have derived the normal
equations by direct differentiation)
t
X XX
nµb + ri α
bi = yij = G (5.1)
i=1 i j
X
ri µb + ri α
bi = yij = Ti for i = 1, 2, . . . , t. (5.2)
j
44
The equations are not independent as ti=1 (5.2)i = (5.1), so there are only t independent
P
equations for t + 1 unknowns and there are infinitely many solutions. Does this matter?
No, because the fitted values are always the same whatever solution is chosen and hence the
residuals and residual sum of squares are the same.
In order to solve the equations we have to add another equation or constraint on the
estimates. We consider three main ways of doing this.
1. We take the estimate of µ equal to the overall mean.
PP
G yij
µb = ȳ·· = =
n n
This implies that
b i = ȳi· − ȳ··
α
where ȳi· = Ti /ri and ȳ·· = G/n.
2. We take the estimate of µ equal to zero, µb = 0. This implies that
α
b i = Ti /ri = ȳi·
3. We take α
b 1 = 0. It follows that µ
b = ȳ1· and hence that
For each of the three solutions the fitted values (estimated means) are
µb + α
b i = ȳi·
for a unit in the ith treatment group. So the fitted values are identical, as are the residuals
which are observed value minus fitted value:
eij = yij − ȳi· ,
the residual sum of squares and hence s2 . The residual sum of squares is given by
(yij − ȳi· )2
XX
Smin =
i j
which is the residual sum of squares after fitting the null model Yij = µ + εij , i.e. a model
with no treatment effects where α1 = α2 = · · · = αt . The total sum of squares can be
decomposed:
(yij − ȳ·· )2
XX
Sy,y =
i j
45
since ri
X
(yij − ȳi· ) = 0.
j=1
Hence
Sy,y = Smin + Sum of Squares between treatments
− ȳi· )2 is the residual sum of squares under the full model
P P
where Smin = i j (yij
Yij = µ + αi + εij
and the Sum of Squares between treatments = i ri (ȳi· −ȳ·· )2 is the difference in residual sums
P
of squares between the null model and the full model, i.e. the extra SS due to the treatment
effects or the increase in residual sum of squares when we assume H0 : α1 = · · · = αt to be
true. This is the extra SS due to H0 . In order to test H0 , note that
(Yij − Ȳi. )2 ]
XX
E[Smin ] = E[
i j
(Yij − Ȳi. )2 ]
X X
= E[
i j
(ri − 1)σ 2
X
=
i
= (n − t)σ 2
and hence
Smin
s2 =
n−t
2
is an unbiased estimate of σ . Note that the denominator is the number of observations
minus the number of independent parameters, or the number of observations minus the rank
of the design matrix.
Under the null model Yij = µ + εij , all treatments have the same effect. If H0 is true
E[Sy,y ] = (n − 1)σ 2
and
Hence if H0 is true
Treatment SS
t−1
2
is an unbiased estimate of σ . Under the normality assumption it can be shown that
Treatment SS
∼ χ2t−1
σ2
under H0 , and
Smin
∼ χ2n−t
σ2
46
(whether H0 is true or not), and that they are independent. Hence
under H0 . So we reject H0 (i.e. conclude that there is sufficient evidence to suggest treatment
t−1
differences) if F ? > Fn−t (α).
We can set out the calculations in an Analysis of Variance table. Note that, by writing
ȳi· = Ti /ri and ȳ·· = G/n and expanding the square, it is straightforward to show that the
sum of squares between treatments can be written as i Ti2 /ri − G2 /n.
P
Source d.f. P SS MS F
T2 G2 SS SS/(t−1)
Treatments t−1 − i i
ri n t−1 s2
2
Residual n−t by subtraction s
G2
n − 1 Sy,y = i j Yij2 −
P P
Total n
DIET
A B C D
62 63 68 56
60 67 66 62
63 71 71 60
59 64 67 61
65 68 63
66 68 64
63
59
244 396 408 488
P 2
Solution The grand total G = 1536 and y ij = 98644. The ANOVA table for these data
is as follows.
Source d.f. SS MS F
Diets 3 228 76 13.6
Residual 20 112 5.6
Total 23 340
3
From tables we find F20 (0.001) = 8.10 and therefore there is very strong evidence for differ-
ences between the treatments, which in this case are the diets.
The following table gives the fitted values and residuals.
47
DIET
A B C D
61 +1 66 −3 68 +0 61 −5
61 −1 66 +1 68 −2 61 +1
61 +2 66 +5 68 +3 61 −1
61 −2 66 −2 68 −1 61 +0
66 −1 68 +0 61 +2
66 +0 68 +0 61 +3
61 +2
61 −2
Note that from the ANOVA table s2 = 5.6, so that s = 2.4. The largest absolute crude
residual is 5, just over 2s. In a sample of size 24 this is not at all surprising.
We could plot the residuals (or the standardised residuals) against both the fitted values
and the treatment group number to check for homogeneity of variance.
As a more formal test we can use Bartlett’s test for Homogeneity of Variance. Bartlett’s
test is used to determine whether the variances within each treatment are in fact equal. It
assumes that the units allocated to different treatments are independent random samples
from normal populations, which is implied by the assumptions made earlier. Let s2i be the
sample variance in the ith treatment group, so that
(yij − ȳi. )2
s2i =
X
.
j (ri − 1)
For large n
B ∼ χ2t−1
Hence reject H0 if B > χ2t−1 (α).
As an example we consider the blood data.
ȳi· 61 66 68 61
s2i 3.33 8.00 2.80 6.86
ri 4 6 6 8
Here n = 24 and t = 4. We find that B = 1.67. Comparing with the tables of χ23 , we find
the P value lies between 0.5 and 0.75 and that there is no evidence against H0 .
Strictly for the χ2 approximation to be appropriate, the ri should all be greater than or
equal to 5.
48
5.4 Testing specific comparisons
We now consider the problem of testing specific comparisons between treatments. These
comparisons must have been planned before the experiment was performed. If not use the
method(s) of multiple comparisons, see section 6.6.
Now n − t = 20 and hence the P value lies between 0.002 and 0.005. There is very strong
evidence against H0 .
In this example a 100 × (1 − α) confidence interval for α1 − α2 is given by
q
−5 ± t20 (α/2) 5.6(1/4 + 1/6).
Now
[Ȳ1. + Ȳ2. + Ȳ3. − 3Ȳ4. ] = [Ȳ1. ] + [Ȳ2. ] + [Ȳ3. ] + 9[Ȳ4. ]
σ 2 σ 2 σ 2 9σ 2
= + + + .
r1 r2 r3 r4
Thus we see that the test statistic is given by
ȳ1. + ȳ2. + ȳ3. − 3ȳ4.
t? = .
s(1/r1 + 1/r2 + 1/r3 + 9/r4 )1/2
49
If H0 is true, t? has a t-distribution with n − t degrees of freedom.
As an example suppose we wish to compare treatment D with the rest for the blood data.
We see that
61 + 66 + 68 − 3 × 61
t? = q = 3.88
5.6(1/4 + 1/6 + 1/6 + 9/8)
and from tables the P value is less than 0.001. There is thus strong evidence for a difference.
Note we cannot test the hypotheses that α1 = 60, say. But we could test the hypothesis
that µ + α1 = 60 since µb + α
b 1 is the same whatever solution we choose.
These coefficients can be applied to the treatment totals Ti , means ȳi· or estimates α
b i of the
treatment effects
As an example we consider the Caffeine data — see practical 7. There are three treat-
ments (t = 3). For a paired comparison to test treatment 2 (100mg caffeine) against treat-
ment 3 (200mg caffeine)
aT = ( 0 1 −1 )
We estimate the average difference (α2 − α3 ) in the effect between treatments 2 and 3 by
b2 − α
α b 3 = ȳ2· − ȳ3·
Note that X X
E[ ai α
bi] = ai α i
50
so our estimate is unbiased and
X X
[ ai α
bi] = [ ai ȳi· ]
a2i [ȳi· ]
X
=
σ2 X 2
= ai .
r
Thus to test the null hypothesis H0 : α2 −α3 = 0, that there is no difference in effect between
the 100mg and 200mg doses, using the usual t-test we obtain
b2 − α
(α b3) − 0
t? = q ∼ tn−t under H0 .
2s2
r
?2 (α b 3 )2 r/2
b2 − α 1
F =t = 2
∼ Fn−t under H0 .
s
The numerator is called the SS due to the contrast a.
For a grouped comparison to test the effect of treatment 1 (no caffeine) against the mean
effects of treatments 2 and 3 (with caffeine)
bT = ( 2 −1 −1 ) .
Thus !
T (α2 + α3 )
b α = 2α1 − α2 − α3 = 2 α1 −
2
which can be estimated by
bT α b1 − α
b = 2α b2 − α
b 3 = 2ȳ1· − ȳ2· − ȳ3·
b i )2
P
r( ai α
P 2
ai
51
For the caffeine data we have considered two contrasts a and b and in the practical we
find that the sums of squares associated with these contrasts adds to the treatment sum of
squares. We can further examine the relationship between these two contrasts. We find that
b2 − α
[α b1 − α
b 3 , 2α b2 − α
b 3 ] = 0. The two contrasts are independent. Thus we have what is
called an orthogonal decomposition of the treatment SS into two independent contrasts each
testing a specific hypothesis of interest.
i=1
For two orthogonal contrasts the sums of squares are independent. This is a special case
of a more general theorem which we shall not prove.
As an example we will consider some data from an investigation of the effects of dust
on tuberculosis. Two laboratories each used 16 rats. The set of rats in each laboratory
was divided at random into two groups A and B. The animals in group A were kept in an
atmosphere containing a known percentage of dust while those in group B were kept in a
dust free atmosphere. After three months, the animals were killed and their lung weights
measured. The results were as follows
Laboratory 1 A 5.44 5.36 5.60 6.46 6.75 6.03 4.15 4.44
B 5.12 3.80 4.96 6.43 5.03 5.08 3.22 4.42
Laboratory 2 A 5.79 5.57 6.52 4.78 5.91 7.02 6.06 6.38
B 4.20 4.06 5.81 3.63 2.80 5.10 3.64 4.53
There are 4 treatments (t = 4) and 8 rats were given each treatment (r = 8).
The two treatment factors are
1. Laboratory (1 or 2)
each with two levels. Hence there are four possible combinations of factors giving 4 treat-
ments. This is such that treatments have a 2 × 2 factorial structure.
1 : Lab 1 - Atmosphere A
2 : Lab 1 - Atmosphere B
3 : Lab 2 - Atmosphere A
4 : Lab 2 - Atmosphere B
We can look for a set of t − 1 = 3 mutually orthogonal contrasts. One possibility is
aT1 = ( 1 −1 0 0 ), the effect of atmosphere in lab 1,
aT2 = ( 0 0 1 −1 ), the effect of atmosphere in lab 2,
52
aT3 = ( 1 1 −1 −1 ), the overall effect of laboratory, called the main effect of the
factor laboratory.
These are three orthogonal contrasts so the sums of squares form an orthogonal decom-
position of the treatment sum of squares.
A futher set of orthogonal contrasts consists of
bT1 = ( 1 1 −1 −1 ), the main effect of laboratory,
bT2 = ( 1 −1 1 −1 ), the main effect of atmosphere,
bT3 = ( 1 −1 −1 1 ), the interaction between the factors atmosphere and laboratory.
This is the difference in effects of atmosphere between the two labs or the difference in the
effect of laboratory between the two atmospheres.
Interaction is an effect which is not present when the two factors are considered separately,
so a joint effect of the two factors. To test for interaction, the null hypothesis is
H0 : α1 − α2 − α3 + α4 = 0
i.e. α1 − α2 = α3 − α4 .
If H0 is rejected then we conclude that atmosphere and laboratory interact, i.e. they
cannot be considered separately.
With the special structure of a 2 × 2 factorial structure we can use R to find the contrast
sums of squares. (See also Practical 7).
We see that only the main effect of atmosphere is significant. There is no evidence of an
interaction and also no evidence for a difference between the two laboratories.
53
q
2
where ν = n − t and s r
is the estimated standard error of the difference between two
treatment means.
This gives a lower bound for a significant difference between pairs, thus if a preplanned
pair of means differs by more than LSD then there is evidence to support the hypothesis
that the two treatment effects are different. Note that there are t(t − 1)/2 pairs of
treatments, and even if there are no differences in treatment effects, approximately 5%
of differences between pairs will exceed LSD. We should use LSD sparingly!
2. Tukey’s T-Method
This method gives simultaneous 100(1 − α)% confidence intervals for all contrasts
t t
X X |ai |
ai ȳi ± s √ q[α] (t, ν)
i=1 i=1 2 ri
where q[α] (t, ν) is the α percentage point of the distribution of the ‘studentised range’
(max − min)
s2
tabulated in BOC pages 1002-3, where ν = n − t and t is the number of treatments.
For paired comparisons with an equi-replicate design the two methods give us the follow-
ing intervals:
s
Tukey ȳi − ȳj ± √ q[α] (t, n − t)
r
s
2
LSD ȳi − ȳj ± s tn−t (α/2)
r
A method of illustrating the comparisons is to write the means in order of magnitude on
a scaled line and underline pairs not significantly different from each other.
R gives the Tukey confidence intervals for all pairs of treatments. See Practical 6 for
sample output.
54
We assume each treatment affects the mean response only (i.e. not the variance) and the
effect of a particular treatment is the same within each block, there is no treatment-block
interaction. To estimate the unknown parameters µ, αi i = 1, . . . , t, βj j = 1, . . . , b by least
squares, we must minimise
t X
b
(Yij − µ − αi − βj )2
X
S=
i=1 j=1
Rather than writing the model as a general linear model, we shall differentiate them directly.
Setting ∂S
∂µ
= 0, we see that
XX
(Yij − µ − αi − βj ) = 0,
i j
P P
so that, writing G = i j Yij ,
X X
btµb + b α
bi + t βbj = G. (5.3)
i j
∂S
Similarly setting ∂αi
= 0, we see that
X
(Yij − µ − αi − βj ) = 0,
j
P
so that, writing Ti = j Yij ,
X
bµb + bα
bi + βbj = Ti , (5.4)
j
∂S
and setting ∂βj
= 0, we see that
X
(Yij − µ − αi − βj ) = 0,
i
P
so that, writing Bj = i Yij , X
tµb + α
b i + tβbj = Bj . (5.5)
i
Note that (5.3) is equal to both (5.4) summed over i and (5.5) summed over j. There are
thus t + b + 1 equations, but only t + b − 1 independent equations. So there are an infinite
number of solutions. We need two extra equations to have a unique solution.
We define
G
ȳ·· = ,
tb
Ti
ȳi· = ,
b
Bj
ȳ·j = ,
t
Pt Pb
One possible solution is to assume that i=1 α̂i = 0 and j=1 β̂j = 0 and hence
µ̂ = ȳ··
55
α̂i = ȳi· − ȳ··
β̂j = ȳ·j − ȳ··
The fitted values (estimated means) are
α̂1 = β̂1 = 0
Since the fitted values from any solution are always the same it follows that
and hence
µb = ȳ1. + ȳ.1 − ȳ··
b i = ȳi· − ȳ1.
α i = 2, . . . , t
βbj = ȳ·j − ȳ.1 i = 2, . . . , b
For all solutions the residual sum of squares
t X
b
b i − βbj )2
X
Smin = (yij − µb − α
i=1 j=1
b
t X
(yij − ȳi· − ȳ·j + ȳ·· )2
X
=
i=1 j=1
t b
(ȳi· − ȳ·· )2 + t (ȳ·j − ȳ·· )2
X X
Sy,y = Smin + b
i=1 j=1
or in words the total sum of squares is equal to the residual sum of squares plus the treatment
sum of squares plus the block sum of squares. This is the basic Analysis of Variance identity
for this model and may be set out in an analysis of variance table. For calculation purposes
we note that
G2
yij2 −
XX
Sy,y =
i j bt
Ti2 G2
P
i
SS(T ) = −
b bt
56
Bj2 G2
P
i
SS(B) = −
t bt
where SS(T) is the sum of squares due to treatments and SS(B) is the sum of squares due
to blocks. The residual sum of squares can be found by subtraction Under the normality
assumption we can test the effects of treatments and blocks independently.
In order to test for no treatment differences
H0 : α1 = α2 = . . . = αt
we use the statistic
SS(T )/(t − 1) t−1
F = ∼ F(b−1)(t−1)
Smin /(t − 1)(b − 1)
under the null hypothesis H0 .
Similarly to test for block effects
H00 : β1 = β2 = . . . = βb
we use the statistic
SS(B)/(b − 1) b−1
F = ∼ F(b−1)(t−1)
Smin /(t − 1)(b − 1)
under the null hypothesis H00 .
Orthogonal contrasts can be used as before for preplanned comparisons of both treat-
ments and blocks when either of these two tests show evidence for differences. Otherwise
use the methods of multiple comparisons (eg Tukey).
As an example consider a randomised block experiment using eight litters of five rats
each. This was conducted to test the effects of five diets on their weight gains in the four
weeks after weaning. Genetic variability in weight gains was thus eliminated by using litters
as blocks. The following weight gains were obtained:
Litter Diet (Treatment) Total
(Block) A B C D E
1 57.0 64.8 70.7 68.3 76.0 336.8
2 55.0 66.6 59.4 67.1 74.5 322.6
3 62.1 69.5 64.5 69.1 76.5 341.7
4 74.5 61.1 74.0 72.7 86.6 368.9
5 86.7 91.8 78.5 90.6 94.7 442.3
6 42.0 51.8 55.8 44.3 43.2 237.1
7 71.9 69.2 63.0 53.8 61.1 319.0
8 51.5 48.6 48.1 40.9 54.4 243.5
Total 500.7 523.4 514.0 506.8 567.0 2611.9
The sums of squares are given by
Sy,y = 57.02 + 64.82 + · · · + 54.42 − (2611.9)2 /40 = 7584.1
SS(B) = (336.82 + · · · + 243.52 )/5 − (2611.9)2 /40 = 6099.5
SS(T ) = (500.72 + · · · + 567.02 )/8 − (2611.9)2 /40 = 346.9
so that the ANOVA table is
57
Source d.f. SS MS F P
Litters (Blocks) 7 6099.5 871.4 21,46 P < .001
Diets (Treatments) 4 346.9 86.7 2.14 .05 < P < .10
Residual 28 1137.7 40.6
Total 39 7584.1
58
Overall ith level jth level ith level of A
of A of B +jth level of B
No. of obs. n = abr br ar r
Total G Ai Bj Tij
Mean ȳ··· ȳi·· ȳ·j· ȳij·
To find the least squares estimates we must minimise
a X
b X
r
(yijk − µ − αi − βj − (αβ)ij )2
X
S=
i=1 j=1 k=1
∂S
Setting ∂µ
= 0, we see that
XXX
(yijk − µ − αi − βj − (αβ)ij ) = 0,
i j k
so that X X XX
abrµb + br α
b i + ar βbj + r (αβ)
d
ij = G. (5.6)
i j i j
∂S
Similarly setting ∂αi
= 0, we see that
XX
(yijk − µ − αi − βj − (αβ)ij ) = 0,
j k
so that X X
brµb + brα
bi + r βbj + r (αβ)ij = Ai , (5.7)
j j
∂S
setting ∂βj
= 0, we see that
XX
(yijk − µ − αi − βj − (αβ)ij ) = 0,
i k
so that X X
arµb + r α
b i + ar βbj + r (αβ)
d
ij = Bj . (5.8)
i i
∂S
Finally setting ∂(αβ)ij
= 0, we see that
X
(yijk − µ − αi − βj − (αβ)ij ) = 0,
k
so that
rµb + rα
b i + r βbj + r(αβ)ij = Tij . (5.9)
P P P P
Since (5.6) = i j (5.9), (5.7) = i (5.9) and (5.8) = j (5.9), although there are 1 +
a + b + ab parameters there are only ab independent equations. We can solve the equations
Tij
uniquely by putting µ̂ = α b i = βbj = 0 which from (5.9) implies that (αβ)
d
ij = r = ȳij· i=
1, . . . , a j = 1, . . . , b. This is just one solution, but every solution gives the same fitted
values, residuals and residual sum of squares.
59
The fitted values are µb + α b i + βbj + (αβ)
d
ij = ȳij· for any unit assigned the ijth treatment,
i.e. ȳij· is just the ‘cell’ mean. The residuals are yijk − ȳij· . Hence the residual sum of squares
is
a X
b X
r
(yijk − ȳij· )2
X
Smin =
i=1 j=1 k=1
This represents the sum of squares within the treatment groups. There are ab groups, each
with r observations, Smin has ab(r − 1) degrees of freedom.
Another possible solution is
µb + α
b i + βbj + (αβ)
d
ij = ȳij·
Hence, noting that the cross-product terms vanish, the ANOVA identity is
a X
b a b
(ȳij· − ȳi·· − ȳ·j· + ȳ··· )2 + br (ȳi·· − ȳ··· )2 + ar (ȳ·j· − ȳ··· )2
X X X
Sy,y = Smin + r
i=1 j=1 i=1 j=1
or in words, total sum of squares is equal to the residual sum of squares plus the interaction
sum of squares plus the sum of squares due to the main effect of factor A plus the sum of
squares due to the main effect of factor B.
In deciding whether we can simplify the model, we always begin by testing the interaction.
60
If H0 is rejected, finish the analysis presenting the table of means (after adequacy checks).
If H0 is not rejected (ie data shows no evidence for interaction) test
H00 : αi are equal, no main effect of factor A
and
H000 : βj are equal, no main effect of factor B
using the statistics
SS(A)/(a − 1) a−1
F1 = 2
∼ Fab(r−1) under H00
s
and
SS(B)/(b − 1) b−1
F2 = 2
∼ Fab(r−1) under H000
s
The total sum of squares may also be written as
a X
b X
r a X
b
(yijk − ȳij· )2 + r (ȳij· − ȳ··· )2
X X
Sy,y =
i=1 j=1 k=1 i=1 j=1
which is the sum of residual sum of squares and treatment SS on ab(r − 1) and ab − 1 degrees
respectively.
The results can be presented in an ANOVA table. For calculation purposes we proceed
P P P 2
as follows. Firstly calculate the total sum of squares as i j k yijk − G2 /n. Next calculate
the treatment sum of squares as if it were a one-way model that is ignoring the two-way
structure. This is
G2
P P
Tij
SS(T ) = i j − .
r n
Next calculate the sum of squares due to factor A as
A2i G2
P
i
SS(A) = −
br n
and the sum of squares due to factor B as
Bj2 G2
P
j
SS(B) = − .
ar n
The interaction sum of squares is given by SS(A ∗ B) = SS(T ) − SS(A) − SS(B) Finally
the residual sum of squares is given by the total sum of squares minus the treatment sum of
squares.
The ANOVA table is
Source df SS MS F
SS(T ) M S(T )
Between treatments ab − 1 SS(T ) ab−1 s2
SS(A) M S(A)
Factor A a−1 SS(A) a−1 s2
SS(B) M S(B)
Factor B b−1 SS(B) b−1 s2
SS(A∗B) M S(A∗B)
Interaction (a − 1)(b − 1) SS(A ∗ B) (a−1)(b−1) s2
2
Residual ab(r − 1) RSS s
Total abr − 1 TSS
61
As an example we shall analyse the poisons data. Following the scheme for calculation
set out above, we find that the ANOVA table is given by
Source df SS MS F
Between treatments 5 2.2044 0.4409 19.860
Antidotes 3 0.9212 0.3071 13.806
Poisons 2 1.0330 0.5615 23.222
Interaction 6 0.2502 0.0417 1.875
Residual 36 0.8007 0.0222
Total 47 3.0051
6
The interaction is not-significant (F36 (0.1) = 1.945), so we go on to test the main effects of
Antidotes and Poisons. There is strong evidence (P < 0.001) for differences in both Poisons
and Antidotes. Further analysis could be by orthogonal contrasts or multiple comparisons.
The R output for this set of data is as follows
(a)
n n X
n
(xi − x̄)2 = (xi − xj )2 /2n.
X X
(b) In the equi-replicate one-way model to compare t treatments using r replicates per
treatment, show that the between treatments mean square (Mt ) and the residual mean
square (s2 ) can be written in the forms
t X t
r
(Ȳi. − Ȳj. )2
X
Mt =
2t(t − 1) i=1 j=1
t X r X r
1
s2 = (Yij − Yik )2
X
2tr(r − 1) i=1 j=1 k=1
Solution.
62
(a) We can write
63