Ms 236 N 0

Contents
1 Ideas of Statistical modelling 5
2 Simple Linear Regression 7

2.1 The simple linear regression model . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Least Squares Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Properties of the Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Fitted Values and Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.1 Properties of Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 Other residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Estimating σ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 Inferences about the regression parameters . . . . . . . . . . . . . . . . . . . 13
2.7 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 R example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.9 Worked example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.10 Further reading and exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Polynomial regression 19
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 R example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Worked example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Multiple Regression 25
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Estimation of β — the normal equations . . . . . . . . . . . . . . . . . . . . 28
4.3 Properties of Least Squares Estimators . . . . . . . . . . . . . . . . . . . . . 31
4.4 Testing the Adequacy of Different Models . . . . . . . . . . . . . . . . . . . 33
4.4.1 F-test for the Overall Significance of Regression . . . . . . . . . . . . 33
4.4.2 F-test for the deletion of a subset of variables . . . . . . . . . . . . . 33
4.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6 Selecting a Regression Equation . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6.1 Problems with choosing the incorrect regression equation . . . . . . . 35
4.6.2 Automatic methods for selecting a regression model . . . . . . . . . . 36
4.6.3 Statistics for Evaluating Regressions. . . . . . . . . . . . . . . . . . . 37
4.7 Outliers and influence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.8 A check for independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1
4.9 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2
Preface
In this course we are going to extend the ideas of Regression which you met in the first
year. We begin by describing some of the ideas behind any statistical analysis. We then
present a number of useful results. In Chapter 2 we re-examine simple linear regression and
prove the results about it that you have seen. We extend this to polynomial regression in
Chapter 3 and multiple regression in Chapter 4. In Chapter 5 we consider simple designed
experiments, including the analysis of the so-called one-way model which is an extension of
the two sample t-test which you have met before. We also look at randomised block designs
which is an extension of the matched pairs t-test.
Additional reading and exercises are suggested in Linear Statistical Methods: An Applied
Approach, 2nd Edition by B L Bowerman and R T O’Connell (PWS-Kent) hereafter referred
to as BOC. Other books are as follows:
D C Montgomery and E A Peck Introduction to Linear Regression Analysis (2nd ed.)
Wiley
J Neter, W Wasserman and M H Kutner Applied Linear Statistical Models (3rd ed.) Irwin
D C Montgomery Design and Analysis of Experiments (3rd ed.) Wiley
3
4
Chapter 1
Ideas of Statistical modelling
In this chapter we are going to discuss some of the ideas of Statistical Modelling. We start
with a real life problem for which we have some data. We think of a statistical model as a
mathematical representation of the variables we have measured. This model usually involves
some parameters. We may then try to estimate the values of these parameters. We may
wish to test hypotheses about the parameters. We may wish to use the model to predict
what would happen in the future in a similar situation. In order to test hypotheses or make
predictions we usually have to make some assumptions. Part of the modelling process is to
test these assumptions. Having found an adequate model we must compare its predictions,
etc. with reality to check that it does seem to give reasonable answers.
We can illustrate these ideas using a simple example. Suppose that we are interested in
some items which are manufactured in batches. The size of the batch and the time to make
the batch in man hours are recorded, see Table 1.1. We can plot these points on a scatter
diagram, see Figure 1.1.
From the scatter diagram we can see that a straight line model would be appropriate,
Y = α + βx + error, where α represents the time to set up the equipment to manufacture
a batch and β represents the time taken to make each item. You know that the estimate of
this line is α
b + βx
b where
Sxy
b = ȳ − βbx̄, βb =
α .
Sxx
We shall derive these results in Chapter 3. It follows that for our data α b = 10 and βb = 2.
How can we check the assumptions in our model? You have seen before that we can
calculate the residuals ei = yi − (α b + βx
b ). Plots of the residuals can be made to check
i
various assumptions. Note that we should not make predictions from our model outside the
range of data on which it is fitted. For example, in this case batches of size more than eighty
might cause the machinery to overheat and the straight line relationship between y and x
would no longer hold.
5
x (batch size) y (man-hours)
30 73
20 50
60 128
80 170
40 87
50 108
60 135
30 69
70 148
60 132
Table 1.1: Data on batch size and time to make each batch
y 6
?
?
?
?
?
?
??
-
x
Figure 1.1: Scatter diagram of data in Table 1.1
6
Chapter 2
Simple Linear Regression
2.1 The simple linear regression model

The simple linear regression model takes the following form:
Yi = α + βxi + εi for i = 1, 2, . . . , n.
Note that the xi ’s are constants in the model, the values of which are either of a controlled
variable (e.g. batch size) or regarded as constant because they are measured with virtually
no error.
I shall follow, as far as possible, the convention that capital letters are used for random
variables and small letters for observed values of the random variable. Thus I write the
model above with a Yi denoting that it is a random variable. This rule is generally followed,
but an exception is made for quantities represented by greek letters, where both the random
variable and its observed value are represented by the same small letters. Thus εi is a random
variable.
As we discussed in Chapter 1 we have to make some assumptions about the errors εi .
The minimum assumption necessary to justify use of the least squares estimates is that they
have expectation zero, constant variance σ 2 and that they are uncorrelated. In order to
make any statistical inferences about our model, we have to make an assumption about the
form of the distribution of the errors. Consequently we usually assume that the errors are
independent, normal random variables.
The two assumptions together imply that the errors are independent with distribution
εi ∼ N (0, σ 2 ).
Equivalently, using the model, we can say that the observations are independent with dis-
tribution
Yi ∼ N (α + βxi , σ 2 ). (2.1)
2.2 Least Squares Estimators

The principle of least squares says we should choose estimators α
b and βb (of α and β respec-
tively) in order to minimise the sum S of squared distances between the observed values yi
7
and the fitted straight line α + βxi given by
n
[yi − α − βxi ]2 .
X
S=
i=1
To find the least squares estimators we must minimise S. To do this we find the solutions
to the equations
∂S/∂α = 0 and ∂S/∂β = 0.
Now
n
∂S X
= −2 (yi − α − βxi ),
∂α i=1
n
∂S X
= −2 xi (yi − α − βxi ).
∂β i=1
So the least squares estimates α

b and βb are solutions to the equations
n
X n
X
nα
b + βb xi = yi , (2.2)
i=1 i=1
n n n
x2i =
X X X
α
b xi + βb xi yi . (2.3)
i=1 i=1 i=1
These equations are called the normal equations. They are two linear equations in two
unknowns and have a unique solution provided (xi − x̄)2 > 0. If we multiply (2.3) by n
P
P
and (2.2) by xi and subtract we obtain the estimate for β as
xi y i − xi y i
P P P
n
βb = P 2 . (2.4)
n xi − ( xi ) 2
P
To find the estimate for α we divide (2.2) by n and rearrange to obtain

b = ȳ − βbx̄.
α
We may express (2.4) in other useful ways. For calculating βb with a calculator we usually
use
xi yi − ( xi yi )/n
P P P
β= P 2
b .
xi − ( xi )2 /n
P
We can also write it as

Sxy
βb =
Sxx
Pn Pn
where Sxy = i=1 (xi − x̄)(yi − ȳ) and Sxx = i=1 (xi − x̄)2 .
Note also that
(xi − x̄)yi
P
β= P
b
(xi − x̄)2
P
which is of the form ai yi , where
(xi − x̄) (xi − x̄)
P P
ai = P = .
(xi − x̄)2 Sxx
8
The quantity S is the sum of squared distances about the line. The minimum value of S
denoted by Smin is given by
n
b )2 .
X
Smin = (yi − α
b − βx i
i=1
b = ȳ − βbx̄ gives
Substituting for α
n
2
X
Smin = [(yi − ȳ) − β(x
b
i − x̄)]
i=1
n n n
2
− ȳ) + βb2 (xi − x̄)2 .
X X X
= (yi − ȳ) − 2βb (xi − x̄)(yi
i=1 i=1 i=1
Sxy
Substituting for βb = Sxx
gives
[Sxy ]2
Smin = Syy −
Sxx
Pn
where Syy = i=1 (yi − ȳ)2 .
2.3 Properties of the Estimators

Definition If θb is an estimator of θ and E[θ]
b = θ, then we say θb is unbiased for θ.
Note that in this definition θb is a random variable. We must distinguish between θb when
it is an estimate and when it is an estimator. As a function of the observed values yi it is
an estimate, as a function of the random variables Yi it is an estimator.
The parameter estimator βb can be written as
n
X xi − x̄ xi − x̄
βb = ai Yi , where ai = Pn 2
= . (2.5)
i=1 i=1 (xi − x̄) Sxx
We have assumed that Y1 , Y2 , . . . , Yn are normally distributed and hence using the result that
any linear combination of normal random variables is also a normal random variable, βb is also
normally distributed. We now derive the mean and variance of βb using the representation
(2.5).
n
X
E[β]
b = E[ ai Yi ]
i=1
n
X
= ai E[Yi ]
i=1
Xn
= ai (α + βxi )
i=1
n
X n
X
= α ai + β ai x i
i=1 i=1
9
(xi − x̄)xi = Sxx , so E[β]
P P P
but ai = 0 and ai xi = 1 as b = β. Thus βb is unbiased for β.
Also
" n #
X
[β]
b = ai Yi
i=1
n
a2i [Yi ] since the Y ’s are independent
X
=
i=1
n
σ 2 (xi − x̄)2 /[Sxx ]2
X
=
i=1
= σ 2 /Sxx .
Note that
b = Ȳ − βbx̄
α
n n
1X X
= Yi − x̄ ai Yi
n i=1 i=1
n
1
X
= Yi − ai x̄
i=1 n
where ai = (xi − x̄)/Sxx . Thus the parameter estimator α b = Ȳ − βbx̄ is also a linear
combination of the Yi ’s and hence α
b is normally distributed.
We can also find the mean and variance of α.b
E[α]
b = E[Ȳ ] − E[βbx̄]
= α + β x̄ − x̄E[β]b
= α + β x̄ − β x̄
= α.
Thus α
b is also unbiased. Its variance is given by
" n #
X
[α]
b = b i Yi
i=1
where bi = 1/n − ai x̄, thus

n
b2i [Yi ]
X
[α]
b =
i=1
n
σ 2 b2i
X
=
i=1
n
= σ2 (1/n2 − 2ai x̄/n + a2i x̄2 )
X
i=1
n
= σ 2 [1/n − 0 + (xi − x̄)2 x̄2 /{Sxx 2 }]
X
i=1
2 2
= σ [1/n + x̄ /Sxx ].
10
2.4 Fitted Values and Residuals
The fitted values µb i , for i = 1, 2, . . . , n are the estimated values of E[Yi ] = α + βxi , so
µb i = α
b + βx
b .
i
The residuals (also called crude residuals) are defined as ei = Yi − µb i , thus
ei = Yi − (αb + βx
b )
i
= Yi − Ȳ − β(xi − x̄).
b
2.4.1 Properties of Residuals
n
X n
X n
X
ei = (Yi − Ȳ ) − βb (xi − x̄)
i=1 i=1 i=1
= 0.
The mean of the ith residual is
E[ei ] = E[Yi − αb − βx
b ]
i
= E[Yi ] − E[α] − xi E[β]
b b
= α + βxi − α − βxi
= 0.
The variance is given by

1 (xi − x̄)2
" #
2
[ei ] = σ 1− − ,
n Sxx
which can be shown by writing ei as a linear combination of the Yi ’s. Note that it depends
on i, that is the variance of ei is not quite constant unlike that of εi . Similarly it can be
shown that the covariance of two residuals ei and ej is
" #
2 1 (xi − x̄)(xj − x̄)
[ei , ej ] = −σ + .
n Sxx
Now we know that [εi ] = σ 2 and [εi , εj ] = 0. So the crude residuals ei do not quite mimic
the properties of εi .
2.4.2 Other residuals

Definition We define the normalised residual
ei
ri =
s
where s2 is an unbiased estimator of σ 2 .
11
Then a normalised residual bigger than 2 is surprising and a normalised residual bigger than
3 is very surprising, as we have assumed that the εi are normal with variance σ 2 .
Definition We define the standardised residual

ei
ti = 1
2
[s (1 − vi )] 2
where
1 (xi − x̄)2
vi = +
n Sxx
The standardised residuals have a more nearly constant variance and a smaller covariance
than the crude residuals.
The standardised residuals ti can be plotted against the following:
1. xi – a systematic pattern suggests non-linear terms in x (⇒ polynomial regression).
2. Fitted values µb i – a systematic pattern (particularly a funnel shape) indicates the

variance varies with the mean.
3. Other explanatory variables – a systematic pattern indicates the variable should be

included in the model. (⇒ multiple regression.)
4. i (serial order) e.g. time – a systematic pattern indicates serial correlation which con-
tradicts [εi , εj ] = 0 for i 6= j.
5. normal order statistics - normal plot – a curved line indicates non-normality.
2.5 Estimating σ 2
To find s2 , an unbiased estimate of σ 2 , we need to calculate E[Smin ]. Now
b )2
X
Smin = (Yi − α
b − βx i
e2i .
X
=
Thus
e2i ] = E[e2i ]
X X
E[Smin ] = E[
X
= [ei ] since E[ei ] = 0
Xh i
= 1 − 1/n − (xi − x̄)2 /Sxx σ 2
= σ 2 [n − 1 − (xi − x̄)2 /Sxx ]
X
= σ 2 (n − 2).
Thus we see that Smin /(n − 2) is an unbiased estimate of σ 2 . We therefore define the residual
mean square as
Smin
s2 = .
n−2
12
For calculation purposes we usually use
[Sxy ]2
Smin = Syy − .
Sxx
This expresses the variability between the residuals as the variability among the original
observations less the variability of the fitted values. We can rewrite the expression as
Syy = βb2 Sxx + Smin
or
Total sum of squares is equal to the sum of squares due to the regression plus
the residual sum of squares.
The total sum of squares
n
(yi − ȳ)2
X
Syy =
i=1
is the residual sum of squares for the null model
Y i = µ + εi , i = 1, 2, . . . , n.
Thus we are fitting the same constant to each observation. For the null model the fitted
values are µb = ȳ and hence the residual sum of squares for this model is
n
(yi − ȳ)2 .
X
Syy =
i=1
2.6 Inferences about the regression parameters

We have shown in section 2.3 that
x̄2
!!
2 1
b ∼ N α, σ
α +
n Sxx
and
σ2
!
βb ∼ N β, .
Sxx
It follows that
βb − β
1 ∼ N (0, 1).
σ/[Sxx ] 2
If σ 2 is unknown, then we estimate it by s2 and
βb − β
1 ∼ tn−2
s/[Sxx ] 2
where s2 = Smin /(n − 2). Hence a 95% confidence interval for β is
s
βb ± tn−2 (0.025) 1 .
[Sxx ] 2
13
Similarly for α, it follows that
b−α
α
!1 ∼ N (0, 1)
1 x̄2 2
σ +
n Sxx
and when σ 2 is unknown

b−α
α
!1 ∼ tn−2 .
1 x̄2 2
s +
n Sxx
You can write down for yourself a 95% confidence interval for α.
To test the hypothesis that H0 : β = 0 versus H1 : β 6= 0, we use the test statistic
βb
1
s/[Sxx ] 2
which has a t distribution with n−2 degrees of freedom. Because of the relationship between
a t distribution and an F distribution, namely that if T ∼ tν then T 2 ∼ Fν1 we also have
that
βb2 Sxx 1
∼ Fn−2 .
s2
This may be rewritten as
sum of squares due to regression/1 1

∼ Fn−2
Smin /(n − 2)
We reject H0 if this quantity is too large.

This information can all be summarised in an Analysis of Variance (ANOVA) table:
Source SS d.f. MS F
βb2 Sxx
Regression βb2 Sxx 1 βb2 Sxx 2
s
Residual Smin n−2 s2
Total Syy n−1
where SS stands for sums of squares and MS for mean square.

The validity of the F -test depends on the assumptions made about the model which
should be checked by residual plots.
2.7 Prediction
Suppose that we wish to predict a new value of Y for a given value of x = x0 , say. The linear
predictor is
Yb = α
b + βx 0 − x̄).
b = Ȳ + β(xb
0
There are two cases to consider:
14
1. We wish to predict the expected response (or average response) at x0 .
2. We wish to predict the response of an individual with x = x0 .
Now
E[α
b + βx
b ] = E[α]
0 b + x0 E[β]
b
= α + βx0 .
Also
[α
b + βx 0 − x̄)]
b ] = [Ȳ + β(x
b
0
2 b
= [Ȳ ] + 2[Ȳ , β(x 0 − x̄)] + (x0 − x̄) [β]
b
= σ 2 /n + 0 + (x0 − x̄)2 σ 2 /Sxx

= σ 2 [1/n + (x0 − x̄)2 /Sxx ].
So the distribution of the linear predictor is
1 (x0 − x̄)2
!!
2
α
b + βx
b
0 ∼ N α + βx0 , σ + .
n Sxx
Therefore the 100(1 − α)% confidence interval for the predicted mean in case 1 is
!1
b ±t α 1 (x0 − x̄)2 2
α
b + βx 0 n−2 ( )s + .
2 n Sxx
For an individual in case 2 we have to take account of not only the variation of the
fitted line about the true line, but also the variation of the individual about the line. There
is therefore an extra variation of σ 2 to be taken into account. Therefore the 100(1 − α)%
confidence interval for an individual is
!1
α 1 (x0 − x̄)2 2
α + βx0 ± tn−2 ( )s 1 + +
b b .
2 n Sxx
Predictions are more precise, i.e. the variance is smaller, the more of the following are true.
1. s2 is small
2. n is large
3. |x0 − x̄| is small
4. Sxx is large.
15
2.8 R example
A fire insurance company wants to relate the amount of fire damage in major residential fires
to the distance between the residence and the nearest fire station. The study was conducted
in a large suburb of a major city; a sample of fifteen recent fires in the suburb was selected.
The amount of damage, y, and the distance, x, between the fire and the nearest fire station
are given in the following table.
x y x y
3.4 26.2 2.6 19.6
1.8 17.8 4.3 31.3
4.6 31.3 2.1 24.0
2.3 23.1 1.1 17.3
3.1 27.5 6.1 43.2
5.5 36.0 4.8 36.4
0.7 14.1 3.8 26.1
3.0 22.3
These data were read into R and analysed using the regression commands. A preliminary
plot of the data indicated that a straight line model was appropriate. The output in the
session window was as follows.
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-3.4682 -1.4705 -0.1311 1.7915 3.3915
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.2779 1.4203 7.237 6.59e-06 ***
x 4.9193 0.3927 12.525 1.25e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.316 on 13 degrees of freedom

Multiple R-Squared: 0.9235, Adjusted R-squared: 0.9176
F-statistic: 156.9 on 1 and 13 DF, p-value: 1.248e-08
Call:
aov(formula = fire.fit)
Terms:
x Residuals
Sum of Squares 841.7664 69.7510
16
Deg. of Freedom 1 13
Residual standard error: 2.316346

Estimated effects may be unbalanced
In this example the intercept or constant term represents the minimum damage caused
by a fire however quickly the fire brigade take to arrive. The x term represents the extra
damage caused by the fact that the distance from the nearest fire station is highly correlated
with the time it takes the fire brigade to arrive at the fire. We see from both the ANOVA
table and the T value for x that there is overwhelming evidence that the damage is related
to the distance from the fire station. Plots of standardised residuals versus fitted values
and x (not shown here) and the normal plot give us no cause to suspect that the model
assumptions are not reasonable.
2.9 Worked example

1 For the regression model
Yi = α + β(xi − x̄) + εi where i = 1, 2, . . . , n
(a) Find the least squares estimates α,

b βb of α, β.
(b) Show that α,

b βb are unbiased and find their variances.
Solution.
(a) We have to minimise

{yi − α − β(xi − x̄)}2 .
X
S=
Differentiating with respect to α and setting equal to zero gives
X X
yi = nα
b + βb (xi − x̄)
(xi − x̄) = 0.
P
which implies that α
b = ȳ since
Differentiating with respect to β and setting equal to zero gives
(xi − x̄)2
X X X
yi (xi − x̄) = α
b (xi − x̄) + βb
which implies that

(xi − x̄)yi
P
Sxy
βb = P 2
= .
(xi − x̄) Sxx
(b) Now
1X 1X 1X
E[α]
b = E[ Yi ] = E[Yi ] = {α + β(xi − x̄)} = α
n n n
and
(xi − x̄)E[Yi ] (xi − x̄)α (xi − x̄)2 β
P P P
E[β]
b = = + = β.
Sxx Sxx Sxx
17
Also
1X 1 X 1 σ2
[α]
b =[ Yi ] = 2 [Yi ] = 2 nσ 2 =
n n n n
and
(xi − x̄)2 [Yi ] (xi − x̄)2 σ 2 σ2
P P
[β]
b = 2
= 2
= .
Sxx Sxx Sxx
An alternative way of solving this problem would be to write
α + β(xi − x̄) = (α − β x̄) + βxi
and use the previous results.
2.10 Further reading and exercises

Chapters 4, 5 and 6 of BOC, excluding sections 5.10, 5.12, 6.5, 6.7, 6.8. Many of the exercises
in these chapters give practice in calculating the least squares estimates, residuals and s2
and making inferences and predictions.
18
Chapter 3
Polynomial regression
3.1 Introduction
As mentioned in Chapter 2, if the plot of standardised residuals versus x shows a systematic
pattern with, for example, mainly negative residuals for small and large values of x and
mainly positive residuals for moderate values of x this may indicate that higher powers of x
should be added to the model. We thus obtain a polynomial regression model. For example
a quadratic regression model is of the form
Yi = α + βxi + γx2i + εi , i = 1, . . . , n.
It may be best to centre the xi ’s by subtracting x̄, but this is not necessary. We may derive
the least squares estimates of α, β and γ as follows.
We have to minimise S = (yi − α − βxi − γx2i )2 . If we differentiate S with respect to
P
α, β and γ and set it equal to zero, we see that the normal equations are
x2i
X X X
y i = nα
b + βb xi + γb (3.1)
x2i x3i
X X X X
xi y i = α
b xi + βb + γb (3.2)
x2i yi x2i x3i x4i .
X X X X
= α
b + βb + γb (3.3)
From (3.1) P 2
x
α
b = ȳ − βbx̄ − γb .
n
Substituting for α
b in (3.2) and (3.3) we have
Sxy = βS
b
xx + γ
b Sxx2
Sx2 ,y = βS
b
x,x2 + γ
b Sx2 ,x2
where Su,v = uv − ( u v)/n.

P P P
This gives us two equations in two unknowns which we solve to obtain

Sxy Sx2 ,x2 − Sx2 ,y Sx,x2
βb = 2
Sxx Sx2 ,x2 − Sx,x 2
and
Sx2 ,y Sxx − Sxy Sx,x2
γb = 2
.
Sx2 ,x2 Sxx − Sx,x 2
19
3.2 R example
An analyst for a cafeteria chain wishes to investigate the relationship between the number
of self-service coffee dispensers in a cafeteria and sales of coffee. Fourteen cafeterias that
are similar in respects such as volume of business, type of clientele and location are chosen
for the experiment. The number of dispensers varies from zero (coffee is only dispensed by
a waitress) to six and is assigned randomly to each cafeteria. The results were as follows,
(sales are measured in hundreds of gallons of coffee sold).
Dispensers Sales Dispensers Sales

0 508.1 3 697.5
0 498.4 4 755.3
1 568.2 4 758.9
1 577.3 5 787.6
2 651.7 5 792.1
2 657.0 6 841.4
3 713.4 6 831.8
The analyst believes that the relationship between sales and number of self-service dispensers
is quadratic in the range of observations. Sales should increase with the number of dispensers,
but if the space is cluttered with dispensers this increase slows down.
Fit a quadratic regression to these data and check the assumptions of the model. Try
also a cubic term in the model and test if this is necessary.
The data were read into R. The original variables were called ‘sales’ and ‘disp’. Variables
‘dispsq’ and ‘dispcub’ representing the square and cube of the number of dispensers were
created using the calculator. The output in the session window resulting from fitting the
quadratic model was as follows.
Call:
lm(formula = sales ~ disp + dispsq)
Residuals:
-10.6643 -5.6622 -0.4655 5.5000 10.6679
Coefficients:
(Intercept) 502.5560 4.8500 103.619 < 2e-16 ***
disp 80.3857 3.7861 21.232 2.81e-10 ***
dispsq -4.2488 0.6063 -7.008 2.25e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

F-statistic: 1391 on 2 and 11 DF, p-value: 5.948e-14
20
Call:
aov(formula = disp.fit)
Terms:
disp dispsq Residuals
Sum of Squares 168740.64 3032.80 679.22
Deg. of Freedom 1 1 11
Residual standard error: 7.857946

Estimated effects may be unbalanced
The plot of residuals versus fitted values indicates no evidence against homogeneity of
variance. The normal plot and test gives no evidence against the normality assumption. A
quadratic model fits very well. There is clear evidence (T = −7.01) that the quadratic term
is needed. The model should not, however, be used outside the range where it was fitted for
prediction purposes.
Call:
lm(formula = sales ~ disp + dispsq + dispcub)
Residuals:
-12.7429 -3.8464 -0.2369 6.6857 8.4179
Coefficients:
(Intercept) 500.3060 5.3586 93.365 4.86e-16 ***
disp 87.8857 8.4630 10.385 1.12e-06 ***
dispsq -7.6238 3.4590 -2.204 0.0521 .
dispcub 0.3750 0.3784 0.991 0.3450
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

F-statistic: 926.1 on 3 and 10 DF, p-value: 1.604e-12
Df Sum Sq Mean Sq F value Pr(>F)

disp 1 168741 168741 2728.3541 1.601e-13 ***
dispsq 1 3033 3033 49.0371 3.704e-05 ***
dispcub 1 61 61 0.9823 0.345
Residuals 10 618 62
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
21
The cubic term is not needed in the model. The cubic term has T = 0.99 on 10 d.f. or
F = 61
62
on 1, 10 d.f.
3.3 Worked example

1 This example illustrates calculations for the simple linear regression model as well as
testing whether a quadratic term should be added to the model.
Experience with a certain type of plastic indicates that a relationship exists between the
hardness (Y ) (measured in Brinell units) of items moulded from the plastic and the elapsed
time (x) in hours since termination of the moulding process. Twelve batches of the plastic
were made, and from each batch one test item was moulded and the hardness measured at
some specific point in time. The results are shown below.
i 1 2 3 4 5 6 7 8 9 10 11 12
xi 32 48 72 64 48 16 40 48 48 24 80 56
yi 230 262 323 298 255 199 248 279 267 214 359 305
(a) Assuming a simple linear regression model find the least squares estimates of α and β.
(b) Find the standard errors of the estimates.
(c) Find a 95% confidence interval for the mean value of Y when x = 56.
(d) On analysing these data a researcher tried fitting successively the null model, a sim-
ple linear regression model and a model also including a quadratic term in x. The
resulting residual sums of squares were 23379, 952.25 and 821.46. Write down an ap-
propriate analysis of variance table and test the hypothesis that the quadratic term is
unneccessary.
Solution. The summary statistics for these data are:

X X X
n = 12, x = 576, y = 3239, xy = 164752,
x2 = 31488, y 2 = 897639.
X X
Thus we have
Sxx = 3840, Syy = 23378.917, Sxy = 9280
and
βb = 2.4167, α
b = 153.917.
The estimate of σ 2 is s2 = 95.225 and hence the estimated variance of [β] b (= s2 /S )

xx
is 0.0248, so the standard error of βb is 0.1575. Similarly the estimated variance of [α]
b
2 2
(= s [1/n + x̄ /Sxx ]) is 65.070, so the standard error of α
b is 8.067.
The 95% confidence interval for the predicted mean value of Y when x = 56 is
!)1/2
(56 − 48)2
(
1
153.917 + 2.4167 × 56 ± 2.228 95.225 +
12 3840
which gives 289.25 ± 6.875 or (282.375, 296.125).
22
Source SS d.f. MS F
Linear Regression 22427. 1 22427. 245.71
Quadratic Regression 130.8 1 130.8 1.433
Residual 821.46 9 91.273
Total 23379. 11
From tables F91 (0.1) = 3.36 so there is no evidence against the hypothesis that the quadratic
term is not needed.

Chapters 8.1 of BOC gives a brief discussion of quadratic and other polynomial regression
models. Some of the exercises at the end of that chapter relate to quadratic regression
models.
23
24
Chapter 4
Multiple Regression
4.1 Introduction
In this chapter we are going to study regression problems with several possible explanatory
variables.
Suppose we have some observations on a variable Y and two other variables x and z
which we think are related to Y . We could first fit a simple linear regression model to Y
and x,
Y = α + βx + ε.
The residuals are given by

ei = yi − (α
b + βx
b ).
i
Now suppose we plot the residuals against z, and suppose that there seems to be a linear
relation between them. This would suggest a model
Y = α + βx + γz + ε.
This is called a multiple regression model with two explanatory variables. If there were other
possible explanatory variables, we could again find the residuals
ei = yi − α
b − βx
b −γ
i b zi
and plot them to see if there is a relationship.

We therefore define a multiple regression model for p−1 explanatory variables x1 , x2 , . . . , xp−1
by
Yi = β0 + β1 x1i + β2 x2i + . . . + βp−1 xp−1,i + εi .
This model has p unknown parameters β0 , β1 , . . . , βp−1 . We make assumptions about the
errors similar to those for the simple linear regression model, namely that the εi are inde-
pendent and normally distributed with mean zero and constant variance σ 2 .
25
It is often convenient to write the multiple regression model in matrix form as
 
    β0  
Y1 1 x11 x21 · · · · · · · · · xp−1,1 ε1

β1 
Y2 1 x12 x22 · · · · · · · · · xp−1,2 ε2
      
β2
      
.. .. .. .. .. .. .. .. ..
      
..
      
 .   . . . . . . .    . 

 ..

 =

 .. .. .. .. .. .. ..

 . +
 
..


 .   . . . . . . .  ..   . 

 ..



 .. .. .. .. .. .. ..

 .  
  ..



.  
. . . . . . .  ..  
. 
.
      
Yn 1 x1n x2n · · · · · · · · · xp−1,n εn
 
βp−1
We often write this matrix formulation as
Y = Xβ + ε (4.1)
where Y is an (n × 1) vector of observations, X is an (n × p) matrix called the design
matrix, β is a (p × 1) vector of unknown parameters and ε is an (n × 1) vector of errors.
This formulation (4.1) is usually called the General Linear Model.
The ith row of X [call it xTi ] is the explanatory data for the ith observation, (i = 1, . . . , n).
The jth column of X is the vector of observations for the explanatory variable xj−1 , for
j = 1, . . . , p, (where x0 is the “constant explanatory variable” which is always equal to 1).
All the models we have considered so far can be written as a General Linear Model.
1. The null model (p = 1).
Yi = µ + εi for i = 1, . . . , n
is equivalent to
Y = 1µ + ε
where 1 is an (n × 1) vector of 1’s.
2. Simple linear regression (p = 2).
Yi = α + βxi + εi for i = 1, . . . , n
can be written as a General Linear Model with
 
1 x1
1 x2 
" #
α

 
X=  .. ..  β= .

 . . 
 β
1 xn
3. Quadratic regression (p = 3).

Yi = α + βxi + γx2i + εi for i = 1, . . . , n
1 x1 x21
 
 

 1 x2 x22 

α
X= .. .. ..  β =  β .
  
. . . 
 
 γ
1 xn x2n
26
4. Multiple regression with two explanatory variables (p = 3).
Yi = β0 + β1 x1i + β2 x2i + εi for i = 1, . . . , n

 
1 x11 x21  

 1 x12 x22 

β0
X= .. .. .. β =  β1  .
   
. . .
 
 
β2
1 x1n x2n
We shall only consider the case when the rank of X that is the number of linearly inde-
pendent columns is equal to p. This means that no column (i.e. no explanatory variable) can
be expressed as a linear combination of the other columns (i.e. other explanatory variables).
This condition is satisfied for many data sets in multiple regression problems.
The error assumptions given above can be written
E[εi ] = 0 [εi ] = σ 2 for i = 1, . . . , n
[εi , εj ] = 0 for i 6= j.
Thus we obtain the following
E[Yi ] = xTi β
which can be written
E[Y ] = Xβ.
Also
[Yi ] = σ 2 for i = 1, . . . , n and [Yi , Yj ] = 0 for i 6= j
which can be written
[Y ] = σ 2 I n
where [Y ] is defined to be the matrix
[Y1 ] [Y1 , Y2 ] · · · · · · [Y1 , Yn ]

 

 [Y1 , Y2 ] [Y2 ] · · · · · · [Y2 , Yn ] 

 .. .. .. .. .. 

 . . . . . 


.. .. .. .. .. 
. . . . .
 
 
[Y1 , Yn ] [Y2 , Yn ] · · · · · · [Yn ]
which is called the (variance-)covariance matrix of Y and is a symmetric matrix of dimension

n × n.
Also because of the normality assumption, i.e. ε1 , ε2 , . . . , εn are independent normal ran-
dom variables, it follows that
Yi ∼ N (xTi β, σ 2 ) for i = 1, . . . , n
and the Yi ’s are independent.
27
P
E@
I
E@
E @
E @
E @
@R
E @
E :

E

E
E

h
E
O hhhhhhh
hhhh E
hE
Q
Figure 4.1: Geometric illustration of least squares
4.2 Estimation of β — the normal equations

We can find an estimate of the vector β by least squares. Let
n
(yi − xTi β)2
X
S =
i=1
= (y − Xβ)T (y − Xβ);
then we choose βb to be the value of β which minimises S.
Rather than solve this problem directly we consider a geometric representation of the
problem. We can represent y and Xβ (for any vector β) as vectors in <n . Let C = {Xβ :
β ∈ <p } which is a subspace of <n called the column space of X (linear combinations of
columns of X), with
dim(C) = rank(X) = p.
Consider the case when there are n = 3 observations and p − 1 = 1 explanatory variables so
that we can illustrate it in Figure 4.1.
The data y is represented by OP ~ . The vector OR
~ = Xβ is a typical vector in C. We
~ 2
are trying to minimise S = |RP | . To minimise the distance from R to P , we take R = Q
such that the angle OQP is a right angle, i.e. we take the orthogonal projection of P onto
C. Thus the vector QP ~ = y − X βb is orthogonal to every vector in C. Hence for any value
of β 0
(Xβ 0 )T (y − X β)
b = 0
T
β 0 X T (y − X β)
b = 0
but since this is true for any value of β 0 it follows that

X T (y − X β)
b = 0.
28
Rearranging this equation we see that
X T y = X T X β.
b
This is a system of p equations in p unknowns, which are called the normal equations.
As rank(X) = p, it is possible to show that the rank(X T X) = p, so the (p × p) matrix
T
X X is non-singular. It follows that the unique solution to the normal equations is given
by
βb = (X T X)−1 X T y
As an example we shall show that we obtain the same estimates as in Chapter 2 for
simple linear regression if we use the matrix formulation.
 
1 x1

 1 x2 
X=  .. .. 
. . 
 

1 xn
 
y1
y2
" #
1 1 ··· 1

T  
X y =  .. 
x1 x2 · · · xn 
 .


yn
" P #
y
= P i
xi y i
" P #
n x
X X=T
P P 2i
xi xi
The determinant of X T X is given by
|X T X| = n x2i − ( xi )2 = nSx,x .
X X
Hence the inverse of X T X is

" P # " P #
1 x2i − xi 1 ( x2i )/n −x̄
P
T −1
(X X) = = .
nSx,x − xi −x̄
P
n Sx,x 1
So the solution to the normal equations is given by
b = (X T X)−1 X T y
β
" P #" P #
1 ( x2i )/n −x̄ yi
=
−x̄
P
Sx,x 1 xi yi
" P #
1 ( x2i yi )/n − x̄ xi yi
P P
=
xi yi − x̄ yi
P P
Sx,x
" P #
1 ( x2i yi )/n − (x̄)2 yi − x̄( xi yi − x̄ yi )
P P P P
=
Sx,x Sx,y
" #
1 Sx,x ȳ − x̄Sx,y
=
Sx,x Sx,y
" #
ȳ − βbx̄
=
βb
29
which is the same result as we obtained in Chapter 3.
The fitted values in Figure 4.1 are given by
µ
b = Xβ ~
b = OQ.
The residuals are

e = y − Xβ ~
b = QP
and hence the residual sum of squares is

b T (y − X β)
Smin = (y − X β) b
n
~ |2 = eT e = e2i .
X
= |QP
i=1
Now, by Pythagoras in the triangle OP Q, we have that

~ |2 = |OQ|
|OP ~ 2 + |QP
~ |2
which we can write as

T
y T y = βb X T X βb + (y − X β)
b T (y − X β)
b
T
= βb X T y + (y − X β)
b T (y − X β).
b
If we subtract nȳ 2 from both sides we obtain

T
y T y − nȳ 2 = (βb X T y − nȳ 2 ) + (y − X β)
b T (y − X β)
b
which is in the form
total sum of squares is equal to sum of squares due to regression plus residual
sum of squares.
We can illustrate this for simple linear regression. The fitted values are
!
α
xTi β
b
µ
bi = b = 1 xi =α
b + βx
b .
i
βb
The residuals are as in Chapter 3. The residual sum of squares is

T
b XT y
Smin = y T y − β
P !
y
yi2 P i
X
= − α
b βb
xi y i
yi2 − (ȳ − βbx̄)
X X X
= yi − β(
b xi y i )
yi2 − ȳ
X X X X
= yi − β(
b xi yi − x̄ yi )
= Sy,y − βS
b
x,y
[Sx,y ]2
= Sy,y −
Sx,x
30
in agreement with our results in Chapter 3.
To summarise for the General Linear Model
y = Xβ + ε
the least squares estimates are given by
b = (X T X)−1 X T y
β
the fitted values by
µ
b = Xβ
b
the residuals by
e=y−µ
b
and the residual sum of squares by
T
Smin = eT e = y T y − βb X T y.
4.3 Properties of Least Squares Estimators

In this section we shall give results, without proof, about the least squares estimator β.
b
Theorem 4.1 The least squares estimator βb is an unbiased estimate of β with variance-
covariance matrix
b = σ 2 (X T X)−1 .
[β]
Theorem 4.2 The n × 1 vector of residuals e has mean

E[e] = 0
and variance-covariance matrix
[e] = σ 2 [I − X(X T X)−1 X T ].
Note that, in general, residuals ei do not have constant variance;

[ei ] = σ 2 (1 − vi )
where vi is the ith diagonal element of the Projection Matrix H = X(X T X)−1 X T so that
vi = xTi (X T X)−1 xi .
The quantity s2 vi is the estimated variance of the linear predictor at the ith observation,
where s2 is a suitable unbiased estimate of σ 2 . So we define the standardised residuals
ei
ti = 1
s(1 − vi ) 2
which have approximately zero mean and variance unity. The residuals ei are not uncorre-
lated.
As in Chapter 3 we can use the standardised residuals to check the model. We can plot
them against
31
1. Fitted values to check homogeneity of variance.
2. Normal order statistics to check normality.
3. Explanatory variables in the model to check for non-linear terms.
4. Possible explanatory variables not in the model to check if they might be included.
5. The serial order to check for correlated errors.
Corollary 4.3 An unbiased estimate of σ 2 is the residual sum of squares divided by n − p

(residual mean square)
Smin
= s2
(n − p)
where p is the number of fitted parameters.
Note that s2 is unbiased only when the model is adequate, i.e. when all the relevant
explanatory variables are included – otherwise it is upwardly biased.
We can check that the theorems give us the same results as we found in Chapter 2 for
simple linear regression. We have seen on page 29 that
" P #
1 x2 −
P
T −1 xi
(X X) = Pi .
nSx,x − xi n
b = σ 2 (X T X)−1 . Thus
Now, by Theorem 4.1, [β]
P 2 2
xσ i
[α]
b =
nSx,x
n o
P 2 P 2 1 x̄2
which, by writing x = x − nx̄2 + nx̄2 , can be writen as σ 2 n
+ Sx,x
as before. Also
" #
b = σ − xi
P
2
[α,
b β]
nSx,x
2
−σ x̄
= ,
Sx,x
and
σ2
[β]
b = .
Sx,x
The quantity vj is given by
vj = xTj (X T X)−1 xj
" P # !
1 x2 −
P
xi 1
= (1, xj ) Pi .
nSx,x − xi n xj
We shall leave it as an exercise to show that this simplifies to
1 (xj − x̄)2
vj = + .
n Sx,x
32
4.4 Testing the Adequacy of Different Models
4.4.1 F-test for the Overall Significance of Regression
Suppose we wish to test the hypothesis
H0 : β1 = β2 = . . . = βp−1 = 0
(i.e. all coefficients except β0 are zero) versus
H1 : at least one of the coefficients is non-zero.
Under H0 , the model reduces to the null model
Y = 1µ + ε
, i.e. in testing H0 we are asking if there is sufficient evidence to reject the null model.
Consider the statistic F ? defined by
(regression SS)/(p − 1)
F? =
s2
where s2 = Smin /(n − p) and so
unbiased estimate of σ 2 (if H0 is true)
F? = .
unbiased estimate of σ 2 (always true)
Thus if H0 is true, F ? ≈ 1 and large values of F ? indicate departures from H0 . Under H0 ,
F ? has an F distribution with degrees of freedom p − 1 and n − p and a test at the α level
for H0 is given by rejecting H0 (ie “the overall regression is significant”) if
p−1
F ? > Fn−p (α)
The ANOVA (Analysis of Variance) table is
SS
Source d.f. SS MS= d.f. F?
SS SS
Overall regression p − 1 SS(x1 , x2 , .., xp−1 ) p−1 p−1
/s2
2
Residual n−p Smin RSS s
Total n−1 S(y,y) TSS
4.4.2 F-test for the deletion of a subset of variables

Suppose H0 is rejected, we may still be able to delete several variables. We can carry out
the Subset test based on the extra sum of squares principle. We are asking if we can reduce
the set of regressors
x1 , x2 , . . . , xp−1
to say
x1 , x2 , . . . , xq−1
(renumbering if necessary) where q < p, by omitting xq , xq+1 , . . . , xp−1 . Thus we are inter-
ested in whether the inclusion of xq , xq+1 , . . . , xp−1 in the model provide a significant increase
in the overall regression sum of squares or equivalently a significant decrease in residual sum
of squares. The difference between the sums of squares is called the extra sum of squares
due to xq , . . . , xp−1 given x1 , . . . , xq−1 already in the model and is defined by the equation
33
SS(xq , . . . , xp−1 |x1 , . . . , xq−1 ) = SS(x1 , x2 , . . . , xp−1 ) − SS(x1 , x2 , . . . , xq−1 )
regression SS for regression SS for
full model (f) reduced model (r)
= RSSr − RSSf
RSS under RSS under
reduced model full model.
To determine whether the change in sum of squares is significant, we must test the hypothesis
H0 : βq = βq+1 = . . . = βp−1 = 0
versus
H1 : at least one of these is non-zero.
It can be shown that, if H0 is true,
extra SS/(p − q) p−q

F? = ∼ Fn−p
s2
So we reject H0 at the α level if
p−q
F ? > Fn−p (α)
and conclude there is sufficient evidence to show some (but not necessarily all) of the ‘extra’
variables xq , . . . , xp−1 should be included in the model.
The ANOVA table is given by
SS
Source d.f. SS MS= d.f. F?
x1 , .., xq−1 q−1 SS(x1 , .., xq−1 )
xq , .., xp−1 |x1 , .., xq−1 p−q Extra SS Extra SS Extra SS/(p−q)
p−q s2
Overall regression p−1 SS(x1 , x2 , .., xp−1 )
Residual n−p Smin s2
Total n−1 S(y,y)
In the ANOVA table we use the notation xq , . . . , xp−1 |x1 , . . . , xq−1 to denote that this is the
effect of the variables xq , . . . , xp−1 given that the variables x1 , . . . , xq−1 are already included
in the model.
Note that the F-test for the inclusion of a single variable xp−1 , (this is the case q = p − 1)
can also be performed by an equivalent t-test where
βbp−1
t= ,
se(βbp−1 )
where se(βbp−1 ) is the estimated standard error of βbp−1 . We compare the value t with tn−p
for a two-sided test of H0 : βp−1 = 0. In fact F ? = t2 .
34
4.5 Prediction
As for simple linear regression, once the model is accepted we can use it to predict
1. Mean response,
2. Individual response.
Suppose we are trying to predict when the values of the explanatory variables are
xT0 = (1 x1,0 x2,0 . . . xp−1,0 ),
then we can predict Y for the two cases by
Yb = xT0 β.
b
The predicted value Yb has mean xT0 β and variance σ 2 xT0 (X T X)−1 x0 , which is estimated
by s2 xT0 (X T X)−1 x0 .
Under the normality assumption, it can be shown that Yb is normal and independent of
s2 . The interval
1 T −1 1
xT0 β
b ± st T
n−p ( α)[x0 (X X) x0 ] 2
2
is a 100(1 − α)% confidence interval for the mean response. As before when we are trying to
predict an individual observation there is greater uncertainty and the corresponding interval
in this case is
1 T −1 1
xT0 β
b ± st T
n−p ( α)[1 + x0 (X X) x0 ] 2 .
2
4.6 Selecting a Regression Equation

It is often the case with a moderate number of explanatory variables that more than one
model seems to fit the data satisfactorily. How should we decide which model to adopt?
There are a number of possibilities and it is quite likely that they will suggest different
models.
4.6.1 Problems with choosing the incorrect regression equation

First we consider what may go wrong if we make an incorrect choice. The first possibility
is that we underfit i.e. we include too few explanatory variables. This will lead to lack of
fit which will show up as a systematic pattern in residual plots. Our model will not explain
much of the variation in the data. The estimated coefficients and linear predictor will be
biased and the estimated error variance s2 will be large.
Conversely if we overfit i.e. we include too many explanatory variables then the variance
of estimates and linear predictors may increase substantially as the number of explanatory
variables increases. The estimated error variance is calculated as s2 = Sd.f.
min
, where the degrees
of freedom of the model is defined to be the number of observations minus the number of
estimated parameters. This variance will be larger than necessary due to the reduction in
degrees of freedom being insufficiently compensated by the decrease in the residual sum of
35
squares. A large model is expensive to monitor and use for prediction since a large number of
explanatory variables must be measured for each observation. If we fit too many variables the
model may be invalid in the desired range of prediction. The decrease in bias is insufficient
to compensate for the increase in variance (for estimation or prediction) causing an increase
in mean square error (MSE) which is defined by
MSE = variance + bias2
4.6.2 Automatic methods for selecting a regression model

A number of automatic methods have been proposed to determine which variables to include
in the model. None are perfect and different methods give different results. Most involve a
sequence of statistical tests and the overall effect of these can be unclear.
The first suggestion is to fit all possible regressions and use the statistics discussed later
to check the adequacy of the model. This is not feasible if we have more than four or five
explanatory variables.
A fairly straightforward procedure is the backward elimination procedure. This proceeds
as follows
1. Fit the multiple regression model with all explanatory variables. (Note that, if some
variables are close to being linear combinations of others, a high level of computational
accuracy is required to ensure that X T X, a near singular matrix, can be inverted to
give sensible results.)
2. Calculate the F (or t) statistic for the exclusion of each variable.
3. Compare the lowest value of the statistic with the predetermined significance level (say
α = 0.05) and omit it if necessary. (Note that the lowest value of the F statistic will
correspond with the variable with lowest value of |estimate|/standard error).
4. We now have a new model with one fewer variables than we started with. We return
to step 2 and calculate the statistic of the exclusion of each variable still in the model.
5. Continue until a variable is not omitted.
A procedure which works in the opposite way to backwards elimination is called Stepwise
regression or Modified ‘forward regression’.
1. Start with the null model.
2. Introduce the explanatory variable with the highest F or t-value for inclusion.
3. After each inclusion, test the other variables in the current model to see whether they
can be omitted. Use say α = 0.10 for omission and α = 0.05 for inclusion.
4. Continue until no more can be included or omitted.
36
A difficulty is that, at each stage, the estimate of σ 2 used may be upwardly biased and
hence not reliable. This can be overcome by using s2 based on all the variables, if this is
computationally feasible.
Another possibility is to leave out step 3 i.e. never omit a variable once it has been
included.
Backwards elimination and Stepwise regression are illustrated in a worked example in
Section 4.9.
4.6.3 Statistics for Evaluating Regressions.

We present some statistics which have been suggested for selecting models.
1. Consider the residual mean square s2 . The behaviour of s2 is such that as the number
of regressors increases, s2 steadily decreases to a relatively stable minimum and then
irregular behaviour. Thus a plot of s2 versus the number of unknown parameters may
indicate the appropriate number of explanatory variables.
2. The multiple correlation coefficient R2 is defined by
R2 = 1 − Smin /S(y,y) .
A difficulty in interpretation is that R2 can be made artificially high by introducing a

large number of ‘non-significant’ regressors. In fact R2 = 1 if we fit n − 1 regressors!
3. An adjusted multiple correlation coefficient R̄2 is defined by
s2
R̄2 = 1 − (n − 1) .
S(y,y)
With this definition, an extra variable included only increases R̄2 if its F -value is
greater than 1. However maximising R̄2 is equivalent to minimising s2 so it does not
add anything.
4. For a model with k unknown parameters Mallow’s Ck statistic is defined by

Smin
Ck = − (n − 2k)
s2
where s2 is a ‘reliable’ estimate of σ 2 from the ‘largest’ regression. The value of Ck is
approximately k when there is no lack of fit. We choose a model with a small value of
Ck , and preferably we look for a Ck close to k which means a small bias.
A suggested general procedure is as follows.
1. Draw up an ANOVA table with some relevant F -tests.
2. List some models with their statistics.
3. Choose 1,2 or 3 (say) models that seem to be ‘good’.
37
4. Compare these models using s2 and Mallow’s Ck .
5. Write a final verbal report drawing conclusions and making recommendations
Using the R command ‘Best subsets regression’ it is possible to find the variables cor-
responding to the smallest residual sum of squares for 1, 2 ,. . . regressors together with the
values of s2 , Ck , R2 and R̄2 . In deciding which model to adopt we do not necessarily take
the one with the minimum value of s2 or Ck . Look if Ck is close to k. Also fit the model
and see if some terms are not significant. The point is that there is really no right answer.
A number of different models will usually fit the model well and give similar predictions.
Deciding which to use is as much an art as a science. If in doubt it is probably better to
choose a model with fewer variables.
4.7 Outliers and influence

Sometimes after fitting a model one or two observations show up on residual plots or normal
plots as being ‘odd’. In particular they have large values in modulus of the standardised
residual. Such observations are often called outliers. These can be due to a mistake in typing
in the data or another sort of recording error. When it is not possible to go back to the
experimenter to ask about these observations there are formal tests which may be carried
out to examine whether these observations seem to come from the same distribution as the
rest of the data. We shall not present any such formal tests here. Rather we recommend
that possible outliers are dealt with by fitting the model with and without them to see if
there is a big effect on parameter estimates, predictions etc. A reference to outliers and their
effect should be made in written reports.
Observations which are a long way out in the x-direction, relative to the other data points,
may have a large influence on the fitted model and are called points with high leverage.
The fitted values are µ b = Hy, where H = X(X T X)−1 X T . The expectation and
variance-covariance matrix of µ b are Xβ and Hσ 2 respectively. In particular, [µ b i ] is equal
2
to vi σ , where vi is the ith diagonal element of H. A large value of vi indicates that the ith
observation has high leverage. How large is large?
X
vi = p
i
so we would expect vi ≈ np . Hence if some vi are larger than 2 np this means these observations
have quite high leverage and if any vi are larger than 3 np then they have a very high leverage.
Example Suppose we consider a simple linear regression model with xi = i for 1, 2, . . . , 10.
1 1
 
1 2  10 55 1 385 −55

X=
 .. .. 

(X T X) = (X T X)−1 =
. .  55 385 825 −55 10
1 10
1 x1
 
1  1 x2  385 −55 1 1 ··· 1


T −1 T
H = X(X X) X =  .. .. 
825 .
 .  −55 10 x1 x2 · · · xn
1 x10
38
vi = (385 − 110xi + 10x2i )/825
So for i = 1, 10 the values of vi are 0.346 0.248 0.176 0.127 0.103 0.103 0.127 . . . 0.346. Now
suppose we move point x10 from 10 to 20.
10 65 1 685 −65

T T −1
(X X) = (X X) =
65 685 2625 −65 10
vi = (685 − 130xi + 10x2i )/2625

So for i = 1, 10 the values of vi are now 0.215 0.177 0.147 0.124 0.109 0.101 0.101 0.109 0.124
0.794. The value of v10 is 0.794 > 2 × 2/10 = 0.4, so observation 10 has high leverage.
We have discussed outliers and points with high leverage. Cook’s distance is a measure
which combines measurements of leverage and how outlying an observation is to determine
the influence of the ith observation on the estimated coefficients. It is defined by
(βb − βb(i) )T (X T X)(βb − βb(i) )

Di = .
ps2
This can be thought of as a scaled distance between βb and βb(i) which is the estimate of β
omitting the ith observation. This would indicate that we need to fit n + 1 regressions, one
using the full data set and n regressions using different reduced data. This is not the case,
since we can rewrite Di as
1 vi
Di = t2
p (1 − vi ) i
1
where ti = ei /[s2 (1 − vi )] 2 is the ith standardised residual.
An observation with a large value of Di compared to the other observations is influential
in determining the parameters of the model. If this observation is omitted, the estimates of
the parameters will change greatly. To decide what is ‘large’, a useful cut-off point is the
p
50% point of an F with p and n − p degress of freedom, Fn−p (.50). This value can be found
using R.
4.8 A check for independence

One of the model assumptions that we have made is that the errors are independent. If
observations are collected successively in time then this independence assumption may be
violated. If we plot the residuals ei (or ti ) versus i then positive autocorrelation shows up as
nearby residuals having similar values and neqative autocorrelation shows up as successive
residuals tending to have opposite signs.
To test the null hypothesis of independence we may use the Durbin-Watson test. The
Durbin-Watson statistic is defined as
Pn 2
i=2 (ei − ei−1 )
d= Pn 2
i=1 ei
where e1 , . . . , en are the time ordered crude residuals. To test the null hypothesis H0 versus
H1 that the errors are positively autocorrelated the form of the test is as follows:
39
If d < dL,α reject H0 at the α level
If d > dU,α don’t reject H0 at the α level
If dL,α < d < dU,α the test is inconclusive.
The values dL,α and dU,α may be found in suitable tables for example in BOC.
Example Consider the steam data of Practical 2. This is an example of a data set collected
in time order. If we fit the model with the explanatory varaibles X and FA the value of the
Durbin-watson statistic is 1.85. From the tables in BOC for n = 25, k − 1 = 2 dL,0.05 = 1.21
and dU,0.05 = 1.55 so we don’t reject the hypothesis of independence at the 5% level.
To test H0 versus an alternative that the errors are negatively autocorrelated the form
of the test is:
If 4 − d < dL,α reject H0 at the α level
If 4 − d > dU,α don’t reject H0 at the α level
If dL,α < 4 − d < dU,α the test is inconclusive.
To test H0 versus a general alternative that the errors are autocorrelated the form of the
test is If d < dL,α/2 or 4 − d < dL,α/2 reject H0 at the α level
If d > dU,α/2 or 4 − d > dU,α/2 don’t reject H0 at the α level
Otherwise the test is inconclusive.
Note that this test relies heavily on the underlying data being normally distributed. In
practice positive autocorrelation is much more common than negative autocorrelation.
For data which is collected in time order, for example many economic data sets, it is
important to check that the errors are not autocorrelated. Otherwise a large value of R2 and
significant regressors may be found when there is really no relationship.
4.9 Worked examples

1 Write the model
Yi = β1 xi + β2 x2i + εi for i = 1, 2, . . . , n
where the εi have mean zero, variance σ 2 and are uncorrelated, in matrix form. Hence find
the least squares estimators βb1 and βb2 , Var[βb1 ] and Cov[βb1 , βb2 ].
Verify from your formula that βb1 is unbiased for β1 .
Solution. We can write this model as a general linear model Y = Xβ + ε where
x1 x21
 
x2 x22 
" #
β1

 
X=  .. ..  , β = β .

. . 

 2
2
xn xn
It follows that
" P P 3 # " P 3 #
x2
P 4
x 1 x − x
X X= T
P 3 P 4 and (X X)−1
T
= P 3 P 2
x x ∆ − x x
P 2P 4
− ( x3 ) 2 .
P
where ∆ = x x Also
" P #
T xY
X Y = P 2
xY
40
so that
" #
βb1
= βb = (X T X)−1 X T Y
βb2
 X 4X
x3 x2 Y
X X 
x xY −
 
=  X ∆ X X .

x2 x2 Y − x3
 X
 xY 
∆
b = σ 2 (X T X)−1 so
We know that [β]
. .
[βb1 ] = σ 2 x4 ∆ and [βb1 , βb2 ] = −σ 2 x3 ∆.
X X
We know by the general theory that E[βb1 ] = β1 ; we must verify it.
x4
"P P 3P 2 #
xy −
P
x xy
E[βb1 ] = E
∆
X
x4 x(β1 x + β2 x2 ) − x3 x2 (β1 x + β2 x2 ) /∆
X X X
=

x4 x2 − ( x3 )2 ) − β2 ( x4 x3 − x3 x4 ) /∆
X X X X X X X
= β1 (
= β1
2 A hospital surgical unit was interested in predicting survival of patients undergoing a

particular type of kidney operation. A random selection of 54 patients was available for
analysis. The recorded independent variables were X1 blood clotting score, X2 prognostic
index, X3 enzyme function test score, X4 kidney function test score. The dependent variable
Y is the log of the survival time.
A researcher found the following table of residual sums of squares (RSS)
Model Formula RSS Model Formula RSS

NULL 3.973 X2 + X3 0.7431
X1 3.496 X2 + X4 1.392
X2 2.576 X3 + X4 1.245
X3 2.215 X1 + X2 + X3 0.1099
X4 1.878 X1 + X2 + X4 1.390
X1 + X2 2.232 X1 + X3 + X4 1.116
X1 + X3 1.407 X2 + X3 + X4 0.465
X1 + X4 1.876 X1 + X2 + X3 + X4 0.1098
Fit a multiple regression model by
(a) backward elimination
(b) stepwise regression
(c) forward fitting allowing variables to be included only, never excluded.

In each case use the estimate of σ 2 from the full model. Comment on your results.
41
(d) Compare the ‘best’ regressions on two and three variables using Mallow’s Ck statistic.
Solution. There are 54 observations and the full model has 4 explanatory variables so
0.1098
s2 = = 0.00224.
49
(a) For backward elimination we test if we can exclude the variable whose exclusion in-
creases the residual sum of squares least. Thus the first candidate for exclusion
is X4 . The F value is (0.1099 − 0.1098)/s2 = 0.044 and, as this is less than 1,
we can certainly exclude X4 . Once X4 has been excluded then excluding the vari-
able X1 produces the smallest increase in residual sum of squares. The F value is
(0.7431 − 0.1099)/s2 = 282.6, so there is clear evidence against the hypothesis that X1
should be excluded and hence our final model is X1 + X2 + X3 .
(b) For stepwise regression we start with the null model and test if we should include the
variable whose inclusion reduces the residual sum of squares most.
The first candidate for inclusion is X4 . The F value is (3.973−1.878)/s2 = 934.9, so we
must include X4 . Next X3 produces the biggest reduction in residual sum of squares
from the model with just X4 ; the F value is (1.878 − 1.245)/s2 = 282.5, so we must
also include X3 . Then X2 produces the biggest reduction in residual sum of squares
from X3 + X4 ; the F value is (1.245 − 0.465)/s2 = 348.1, so we have to include X2 .
We must now consider whether, with X2 included, X3 or X4 could be omitted. The F
values are (1.392 − 0.465)/s2 = 413.7 and (0.7431 − 0.465)/s2 = 124.1, so we cannot
exclude either. Now can we include X1 ? The F value is (0.465 − 0.1098)/s2 = 158.5,
so we must include X1 . Now we can exclude X4 as for backward elimination but no
other variables, so the final model is X1 + X2 + X3 .
(c) If we use forward fitting but never exclude variables then our final model will be
X1 + X 2 + X3 + X 4 .
Although X4 is the single most informative variable, it is unnecessary when the others
are included in the model. This illustrates why forward fitting is not to be recom-
mended.
(d) The best model on two variables is X2 + X3 with k = 3 The value of Mallow’s Ck is
given by
0.7431
C3 = − 54 − 6 = 283.6.
s2
Similarly for three variables the best model is X1 + X2 + X3 with
0.1099
C4 = − 54 − 8 = 3.04.
s2
Clearly the second model is to be preferred.

Sections 6.5, 6.7, Chapters 7 (for revision) and 8, section 10.4, chapter 11 of BOC. Try a
selection of exercises from these Chapters.
42
Chapter 5
Analysis of Designed Experiments
5.1 Introduction
In this chapter we discuss experiments whose main aim is to study and compare the effects
of treatments (diets, varieties, doses) by measuring response (yield, weight gain) on plots or
units (points, subjects, patients). In general the units are often grouped into blocks (groups,
sets, strata) of ‘similar’ plots or units.
For the analysis we use the general theory of the linear model (Chapter 4). We assume
that we have a response Y which is continuous (e.g. a normal random variable). We also
have possible explanatory variables which are discrete or qualitative called factors. Here we
consider treatment factors whose values are assigned by the experimenter. The values or
categories of a factor are called levels, thus the levels of a treatment factor are labels for the
different treatments.
5.2 Completely Randomised Design

We allocate t treatments randomly to the n units, such that the ith treatment is allocated
to ri units (i = 1, 2, . . . , t) (i.e. each treatment has ri replications). Note that there is no
blocking here: all n units are regarded as ‘similar’ and there is one treatment factor with
t levels. Also note that a completely randomised design with t = 2 gives an experiment
which would be analysed with a two-sample t-test. A matched pairs t-test corresponds to
an experiment with blocks and 2 treatments.
For the completely randomised design we use the so-called one-way model
Yij = µ + αi + εij
where j = 1, 2, . . . , ri , i = 1, 2, . . . , t and ti=1 ri = n. In this model yij is the response of
P
the jth unit receiving the ith treatment, µ is an overall mean effect, αi is the effect due
to the ith treatment and εij is random error. We have the usual least-squares assumptions
that the εij ’s have zero mean and constant variance σ 2 and are uncorrelated. In addition,
we usually make the normality assumption that the εij ’s are independent normal random
variables. Together these imply that
Yij ∼ N (µ + αi , σ 2 )
43
and the Yij ’s are independent. We may note the following:
1. If treatments act multiplicatively on the response, so that Yij = µαi εij and the variance
of Yij is non-constant, then by taking logs of the response, we may obtain a valid one-
way model.
2. Units, even those given the same treatment, are assumed to be uncorrelated.
3. The original response may need to be transformed to achieve normality.
We can write the model in matrix form Y = Xβ + ε.

   
Y11 
1 1 0 0 ... 0

ε11
 Y12  1 1 0 0 ... 0   ε12 
    

 . 
 .   .. .. .. .. .. ..  µ  .
.

 .  .
 . . . . .   α1  
 

 . 

 Y1r1  1 1 0 0 ... 0   α   ε1r1 
   
  
Y =  2 + 
   
 21  1 0 1 0 ... 0 ..   ε21 
 .  
 
 Y22  1 0 1 0 ... 0  ε22 
  
 .   .. .. .. .. .. ..  αt  .. 
   
 .  . . . . . .
 .   . 
Ytrt 1 0 0 0 ... 1 εtrt
We can estimate the unknown parameters β T = (µ, α1 , α2 , . . . , αt ) by least squares or

(equivalently by maximum likelihood under the normality assumption) by solving the normal
equations
(X T X)βb = X T y.
These are (t + 1) equations in (t + 1) unknowns.
n r1 r2 . . . rt
 
r r1 0 ... 0 
 1 
XT X = 
 r2 0 r2 ... 0 
 
 .. .. .. . . . .. 

 . . . . 
rt 0 0 . . . rt
P P
yij G
  
i j
 P y  T 
 Pj 1j   1
XT y = j y2j   T2  ,
   
 = 
..  .. 
 
 . 
 
 . 
Tt
P
j ytj
where G is the grand total and Ti is the total for the ith treatment.
The normal equations are therefore (note that we could also have derived the normal
equations by direct differentiation)
t
X XX
nµb + ri α
bi = yij = G (5.1)
i=1 i j
X
ri µb + ri α
bi = yij = Ti for i = 1, 2, . . . , t. (5.2)
j
44
The equations are not independent as ti=1 (5.2)i = (5.1), so there are only t independent
P
equations for t + 1 unknowns and there are infinitely many solutions. Does this matter?
No, because the fitted values are always the same whatever solution is chosen and hence the
residuals and residual sum of squares are the same.
In order to solve the equations we have to add another equation or constraint on the
estimates. We consider three main ways of doing this.
1. We take the estimate of µ equal to the overall mean.
PP
G yij
µb = ȳ·· = =
n n
This implies that
b i = ȳi· − ȳ··
α
where ȳi· = Ti /ri and ȳ·· = G/n.
2. We take the estimate of µ equal to zero, µb = 0. This implies that
α
b i = Ti /ri = ȳi·
3. We take α
b 1 = 0. It follows that µ
b = ȳ1· and hence that
b i = ȳi· − Ȳ1. for i = 2, 3, . . . , n.

α
For each of the three solutions the fitted values (estimated means) are
µb + α
b i = ȳi·
for a unit in the ith treatment group. So the fitted values are identical, as are the residuals
which are observed value minus fitted value:
eij = yij − ȳi· ,
the residual sum of squares and hence s2 . The residual sum of squares is given by
(yij − ȳi· )2
XX
Smin =
i j

ri
t X
(yij − ȳ·· )2 ,
X
Sy,y =
i j
which is the residual sum of squares after fitting the null model Yij = µ + εij , i.e. a model
with no treatment effects where α1 = α2 = · · · = αt . The total sum of squares can be
decomposed:
(yij − ȳ·· )2
XX
Sy,y =
i j
[yij − ȳi· + ȳi· − ȳ·· )]2

XX
=
i j
(yij − ȳi· )2 + ri (ȳi· − ȳ·· )2 + 2

XX X X X
= (ȳi· − ȳ·· ) (yij − ȳi· )
i j i i j
(yij − ȳi· )2 + ri (ȳi· − ȳ·· )2

XX X
=
i j i
45
since ri
X
(yij − ȳi· ) = 0.
j=1
Hence
Sy,y = Smin + Sum of Squares between treatments
− ȳi· )2 is the residual sum of squares under the full model
P P
where Smin = i j (yij
Yij = µ + αi + εij
and the Sum of Squares between treatments = i ri (ȳi· −ȳ·· )2 is the difference in residual sums
P
of squares between the null model and the full model, i.e. the extra SS due to the treatment
effects or the increase in residual sum of squares when we assume H0 : α1 = · · · = αt to be
true. This is the extra SS due to H0 . In order to test H0 , note that
(Yij − Ȳi. )2 ]
XX
E[Smin ] = E[
i j
(Yij − Ȳi. )2 ]
X X
= E[
i j
(ri − 1)σ 2
X
=
i
= (n − t)σ 2
and hence
Smin
s2 =
n−t
2
is an unbiased estimate of σ . Note that the denominator is the number of observations
minus the number of independent parameters, or the number of observations minus the rank
of the design matrix.
Under the null model Yij = µ + εij , all treatments have the same effect. If H0 is true
E[Sy,y ] = (n − 1)σ 2
and
E[extra SS due to treatments] = (n − 1)σ 2 − (n − t)σ 2

= (t − 1)σ 2 .
Hence if H0 is true
Treatment SS
t−1
2
is an unbiased estimate of σ . Under the normality assumption it can be shown that
Treatment SS
∼ χ2t−1
σ2
under H0 , and
Smin
∼ χ2n−t
σ2
46
(whether H0 is true or not), and that they are independent. Hence
(Treatment SS)/(t − 1) Treatment MS

F? = = t−1
∼ Fn−t
Smin /(n − t) s2
under H0 . So we reject H0 (i.e. conclude that there is sufficient evidence to suggest treatment
t−1
differences) if F ? > Fn−t (α).
We can set out the calculations in an Analysis of Variance table. Note that, by writing
ȳi· = Ti /ri and ȳ·· = G/n and expanding the square, it is straightforward to show that the
sum of squares between treatments can be written as i Ti2 /ri − G2 /n.
P
Source d.f. P SS MS F
T2 G2 SS SS/(t−1)
Treatments t−1 − i i
ri n t−1 s2
2
Residual n−t by subtraction s
G2
n − 1 Sy,y = i j Yij2 −
P P
Total n
5.3 Example. The Blood data

The table below gives coagulation times in seconds for blood drawn from 24 animals randomly
allocated to four different diets, the blood samples being taken in random order. Fit a one-
way model and check for departures. Test for evidence to indicate real differences among
the mean coagulation times associated with diets.
DIET
A B C D
62 63 68 56
60 67 66 62
63 71 71 60
59 64 67 61
65 68 63
66 68 64
63
59
244 396 408 488
P 2
Solution The grand total G = 1536 and y ij = 98644. The ANOVA table for these data
is as follows.
Source d.f. SS MS F
Diets 3 228 76 13.6
Residual 20 112 5.6
Total 23 340
3
From tables we find F20 (0.001) = 8.10 and therefore there is very strong evidence for differ-
ences between the treatments, which in this case are the diets.
The following table gives the fitted values and residuals.
47
DIET
A B C D
61 +1 66 −3 68 +0 61 −5
61 −1 66 +1 68 −2 61 +1
61 +2 66 +5 68 +3 61 −1
61 −2 66 −2 68 −1 61 +0
66 −1 68 +0 61 +2
66 +0 68 +0 61 +3
61 +2
61 −2
Note that from the ANOVA table s2 = 5.6, so that s = 2.4. The largest absolute crude
residual is 5, just over 2s. In a sample of size 24 this is not at all surprising.
We could plot the residuals (or the standardised residuals) against both the fitted values
and the treatment group number to check for homogeneity of variance.
As a more formal test we can use Bartlett’s test for Homogeneity of Variance. Bartlett’s
test is used to determine whether the variances within each treatment are in fact equal. It
assumes that the units allocated to different treatments are independent random samples
from normal populations, which is implied by the assumptions made earlier. Let s2i be the
sample variance in the ith treatment group, so that
(yij − ȳi. )2
s2i =
X
.
j (ri − 1)
Compute the statistic

Pt
(n − t) log s2 − 2
i=1 (ri − 1) log si
B= P .
(ri −1)−1 −(n−t)−1
1+ 3(t−1)
The logarithms here are to base e. The null hypothesis H0 is
H0 : σ12 = σ22 = . . . = σt2 = σ 2 .
For large n
B ∼ χ2t−1
Hence reject H0 if B > χ2t−1 (α).
As an example we consider the blood data.
ȳi· 61 66 68 61
s2i 3.33 8.00 2.80 6.86
ri 4 6 6 8
Here n = 24 and t = 4. We find that B = 1.67. Comparing with the tables of χ23 , we find
the P value lies between 0.5 and 0.75 and that there is no evidence against H0 .
Strictly for the χ2 approximation to be appropriate, the ri should all be greater than or
equal to 5.
48
5.4 Testing specific comparisons
We now consider the problem of testing specific comparisons between treatments. These
comparisons must have been planned before the experiment was performed. If not use the
method(s) of multiple comparisons, see section 6.6.
5.4.1 Simple Paired Comparisons

Suppose that we wish to compare two treatments, say Treatment 1 versus Treatment 2. Our
null hypothesis is H0 : α1 = α2 or equivalently α1 − α2 = 0. The alternative is that α1 and
α2 are not equal. Now µb + α b i = ȳi· so α b1 − αb 2 = ȳ1. − ȳ2. . Also using standard results about
variances
σ2 σ2
[Ȳ1. − Ȳ2. ] = + .
r1 r2
Thus we use the test statistic
ȳ1. − ȳ2.
t? = .
s(1/r1 + 1/r2 )1/2
If H0 is true, t? has a t- distribution with n − t degrees of freedom.
As an example suppose we wish to compare treatments A and B of the blood data. We
know that ȳ1. = 61, ȳ2. = 66 and that s2 = 5.6. Thus
61 − 66
t? = q = −3.27.
5.6(1/4 + 1/6)
Now n − t = 20 and hence the P value lies between 0.002 and 0.005. There is very strong
evidence against H0 .
In this example a 100 × (1 − α) confidence interval for α1 − α2 is given by
q
−5 ± t20 (α/2) 5.6(1/4 + 1/6).
5.4.2 Grouped Comparisons

Suppose now we wish to make a comparison between, say treatment 4 and the mean of
treatments 1, 2 and 3. We are testing the hypothesis H0 : α4 = 13 (α1 +α2 +α3 ) or equivalently
that α1 + α2 + α3 − 3α4 = 0. The test statistic will be
α
b1 + α b 3 − 3α
b2 + α b4
t? = .
s.e.(α
b1 + α
b2 + αb 3 − 3α
b4)
Now
[Ȳ1. + Ȳ2. + Ȳ3. − 3Ȳ4. ] = [Ȳ1. ] + [Ȳ2. ] + [Ȳ3. ] + 9[Ȳ4. ]
σ 2 σ 2 σ 2 9σ 2
= + + + .
r1 r2 r3 r4
Thus we see that the test statistic is given by
ȳ1. + ȳ2. + ȳ3. − 3ȳ4.
t? = .
s(1/r1 + 1/r2 + 1/r3 + 9/r4 )1/2
49
If H0 is true, t? has a t-distribution with n − t degrees of freedom.
As an example suppose we wish to compare treatment D with the rest for the blood data.
We see that
61 + 66 + 68 − 3 × 61
t? = q = 3.88
5.6(1/4 + 1/6 + 1/6 + 9/8)
and from tables the P value is less than 0.001. There is thus strong evidence for a difference.
Note we cannot test the hypotheses that α1 = 60, say. But we could test the hypothesis
that µ + α1 = 60 since µb + α
b 1 is the same whatever solution we choose.
5.5 Method of Orthogonal Contrasts

We shall consider this technique only for equi-replicate designs, i.e. those where ri = r for
each treatment.
The method of orthogonal contrasts
1. gives independent tests of up to t−1 preplanned specified comparisons, called contrasts.
2. partitions the treatment SS into t − 1 independent components (each with 1 degree of

freedom) each testing a specific contrast.
Definition A contrast is a set of coefficients ai , i = 1, 2, . . . , t (one for each treatment, or

more generally one for each of the t levels of the factor of interest) such that
t
X
ai = 0
i=1
These coefficients can be applied to the treatment totals Ti , means ȳi· or estimates α
b i of the
treatment effects
As an example we consider the Caffeine data — see practical 7. There are three treat-
ments (t = 3). For a paired comparison to test treatment 2 (100mg caffeine) against treat-
ment 3 (200mg caffeine)
aT = ( 0 1 −1 )
We estimate the average difference (α2 − α3 ) in the effect between treatments 2 and 3 by
b2 − α
α b 3 = ȳ2· − ȳ3·
This is the same whatever the solution of the normal equations.

In general, we estimate
t t
ai αi = aT α by b i = (aT α).
X X
ai α b
i=1 i=1
Note that X X
E[ ai α
bi] = ai α i
50
so our estimate is unbiased and
X X
[ ai α
bi] = [ ai ȳi· ]
a2i [ȳi· ]
X
=
σ2 X 2
= ai .
r
Thus to test the null hypothesis H0 : α2 −α3 = 0, that there is no difference in effect between
the 100mg and 200mg doses, using the usual t-test we obtain
b2 − α
(α b3) − 0
t? = q ∼ tn−t under H0 .
2s2
r
The equivalent F-test is
?2 (α b 3 )2 r/2
b2 − α 1
F =t = 2
∼ Fn−t under H0 .
s
The numerator is called the SS due to the contrast a.
For a grouped comparison to test the effect of treatment 1 (no caffeine) against the mean
effects of treatments 2 and 3 (with caffeine)
bT = ( 2 −1 −1 ) .
Thus !
T (α2 + α3 )
b α = 2α1 − α2 − α3 = 2 α1 −
2
which can be estimated by
bT α b1 − α
b = 2α b2 − α
b 3 = 2ȳ1· − ȳ2· − ȳ3·
whatever solution of the normal equations is used.

We test the null hypothesis H0 : 2α1 − α2 − α3 = 0, or equivalently α1 = (α2 +α 2
3)
, which
states that the average effect of no caffeine is the same as the average effect of some caffeine.
The test statistic is
b1 − α
(2α b2 − α
b3)
t? = q
2 6s
r
which has a t-distribution with n − t degrees of freedom and is equivalent to

b1 − α
(2α b 3 )2 r
b2 − α
F = t? 2 = 1
∼ Fn−t under H0
6s2
b1 − α
The sum of squares due to the contrast b is (2α b 3 )2 r/6.
b2 − α
In general the sum of squares due to a contrast a is
b i )2
P
r( ai α
P 2
ai
51
For the caffeine data we have considered two contrasts a and b and in the practical we
find that the sums of squares associated with these contrasts adds to the treatment sum of
squares. We can further examine the relationship between these two contrasts. We find that
b2 − α
[α b1 − α
b 3 , 2α b2 − α
b 3 ] = 0. The two contrasts are independent. Thus we have what is
called an orthogonal decomposition of the treatment SS into two independent contrasts each
testing a specific hypothesis of interest.
Definition Two contrasts aT = ( a1 a2 . . . at ) and bT = ( b1 b2 . . . bt ) are ortho-

gonal if
t
ai bi = aT b = 0.
X
i=1
For two orthogonal contrasts the sums of squares are independent. This is a special case
of a more general theorem which we shall not prove.
Theorem 5.1 A set of t − 1 mutually orthogonal contrasts gives a breakdown of the

treatment sum of squares into t − 1 independent components.
As an example we will consider some data from an investigation of the effects of dust
on tuberculosis. Two laboratories each used 16 rats. The set of rats in each laboratory
was divided at random into two groups A and B. The animals in group A were kept in an
atmosphere containing a known percentage of dust while those in group B were kept in a
dust free atmosphere. After three months, the animals were killed and their lung weights
measured. The results were as follows
Laboratory 1 A 5.44 5.36 5.60 6.46 6.75 6.03 4.15 4.44
B 5.12 3.80 4.96 6.43 5.03 5.08 3.22 4.42
Laboratory 2 A 5.79 5.57 6.52 4.78 5.91 7.02 6.06 6.38
B 4.20 4.06 5.81 3.63 2.80 5.10 3.64 4.53
There are 4 treatments (t = 4) and 8 rats were given each treatment (r = 8).
The two treatment factors are
1. Laboratory (1 or 2)
2. Atmosphere (dust free or dusty)
each with two levels. Hence there are four possible combinations of factors giving 4 treat-
ments. This is such that treatments have a 2 × 2 factorial structure.
1 : Lab 1 - Atmosphere A
2 : Lab 1 - Atmosphere B
3 : Lab 2 - Atmosphere A
4 : Lab 2 - Atmosphere B
We can look for a set of t − 1 = 3 mutually orthogonal contrasts. One possibility is
aT1 = ( 1 −1 0 0 ), the effect of atmosphere in lab 1,
aT2 = ( 0 0 1 −1 ), the effect of atmosphere in lab 2,
52
aT3 = ( 1 1 −1 −1 ), the overall effect of laboratory, called the main effect of the
factor laboratory.
These are three orthogonal contrasts so the sums of squares form an orthogonal decom-
position of the treatment sum of squares.
A futher set of orthogonal contrasts consists of
bT1 = ( 1 1 −1 −1 ), the main effect of laboratory,
bT2 = ( 1 −1 1 −1 ), the main effect of atmosphere,
bT3 = ( 1 −1 −1 1 ), the interaction between the factors atmosphere and laboratory.
This is the difference in effects of atmosphere between the two labs or the difference in the
effect of laboratory between the two atmospheres.
Interaction is an effect which is not present when the two factors are considered separately,
so a joint effect of the two factors. To test for interaction, the null hypothesis is
H0 : α1 − α2 − α3 + α4 = 0
i.e. α1 − α2 = α3 − α4 .
If H0 is rejected then we conclude that atmosphere and laboratory interact, i.e. they
cannot be considered separately.
With the special structure of a 2 × 2 factorial structure we can use R to find the contrast
sums of squares. (See also Practical 7).

lab 1 0.0075 0.0075 0.0097 0.9221230
atmo 1 13.0433 13.0433 16.9159 0.0003102 ***
lab:atmo 1 2.0453 2.0453 2.6525 0.1145890
Residuals 28 21.5899 0.7711
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We see that only the main effect of atmosphere is significant. There is no evidence of an
interaction and also no evidence for a difference between the two laboratories.
5.6 Multiple Comparisons

If we have not pre-planned any specific comparisons between treatments and there is evidence
for differences we may still wish to investigate where these differences lie. We shall describe
two methods
1. Least Significant Differences (LSD)
Suppose we have an equi-replicate design so that n = rt. A general procedure for
paired comparisons is to compute
s
2
Least Significant Difference = tν (0.025)s
r
53
q
2
where ν = n − t and s r
is the estimated standard error of the difference between two
treatment means.
This gives a lower bound for a significant difference between pairs, thus if a preplanned
pair of means differs by more than LSD then there is evidence to support the hypothesis
that the two treatment effects are different. Note that there are t(t − 1)/2 pairs of
treatments, and even if there are no differences in treatment effects, approximately 5%
of differences between pairs will exceed LSD. We should use LSD sparingly!
2. Tukey’s T-Method
This method gives simultaneous 100(1 − α)% confidence intervals for all contrasts
t t
X X |ai |
ai ȳi ± s √ q[α] (t, ν)
i=1 i=1 2 ri
where q[α] (t, ν) is the α percentage point of the distribution of the ‘studentised range’
(max − min)
s2
tabulated in BOC pages 1002-3, where ν = n − t and t is the number of treatments.
For paired comparisons with an equi-replicate design the two methods give us the follow-
ing intervals:
s
Tukey ȳi − ȳj ± √ q[α] (t, n − t)
r
s
2
LSD ȳi − ȳj ± s tn−t (α/2)
r
A method of illustrating the comparisons is to write the means in order of magnitude on
a scaled line and underline pairs not significantly different from each other.
R gives the Tukey confidence intervals for all pairs of treatments. See Practical 6 for
sample output.
5.7 The Two Way Model

5.7.1 Two Way Model without Replication and Interaction
This model is given by
Yij = µ + αi + βj + εij
where Yij is the response of the unit in block j assigned to treatment i, µ is an overall mean,
αi is the effect of the ith treatment, βj is the effect of the jth block and εij is the error
assumed independent and normally distributed with mean 0 and variance σ 2 .
This model is appropriate for a Randomised Block Design (without replication). Suppose
there are b blocks and t ’similar’ units. We assign t treatments independently and at random
within each block, so that each treatment is applied once and only once in each block.
54
We assume each treatment affects the mean response only (i.e. not the variance) and the
effect of a particular treatment is the same within each block, there is no treatment-block
interaction. To estimate the unknown parameters µ, αi i = 1, . . . , t, βj j = 1, . . . , b by least
squares, we must minimise
t X
b
(Yij − µ − αi − βj )2
X
S=
i=1 j=1
Rather than writing the model as a general linear model, we shall differentiate them directly.
Setting ∂S
∂µ
= 0, we see that
XX
(Yij − µ − αi − βj ) = 0,
i j
P P
so that, writing G = i j Yij ,
X X
btµb + b α
bi + t βbj = G. (5.3)
i j
∂S
Similarly setting ∂αi
= 0, we see that
X
(Yij − µ − αi − βj ) = 0,
j
P
so that, writing Ti = j Yij ,
X
bµb + bα
bi + βbj = Ti , (5.4)
j
∂S
and setting ∂βj
= 0, we see that
X
(Yij − µ − αi − βj ) = 0,
i
P
so that, writing Bj = i Yij , X
tµb + α
b i + tβbj = Bj . (5.5)
i
Note that (5.3) is equal to both (5.4) summed over i and (5.5) summed over j. There are
thus t + b + 1 equations, but only t + b − 1 independent equations. So there are an infinite
number of solutions. We need two extra equations to have a unique solution.
We define
G
ȳ·· = ,
tb
Ti
ȳi· = ,
b
Bj
ȳ·j = ,
t
Pt Pb
One possible solution is to assume that i=1 α̂i = 0 and j=1 β̂j = 0 and hence
µ̂ = ȳ··
55
α̂i = ȳi· − ȳ··
β̂j = ȳ·j − ȳ··
The fitted values (estimated means) are
µ̂ + α̂i + β̂j = ȳi· + ȳ·j − ȳ··
Another possible solution is to take
α̂1 = β̂1 = 0
Since the fitted values from any solution are always the same it follows that
µ̂ + α̂1 + β̂1 = y1. + y.1 − ȳ··
and hence
µb = ȳ1. + ȳ.1 − ȳ··
b i = ȳi· − ȳ1.
α i = 2, . . . , t
βbj = ȳ·j − ȳ.1 i = 2, . . . , b
For all solutions the residual sum of squares
t X
b
b i − βbj )2
X
Smin = (yij − µb − α
i=1 j=1
b
t X
(yij − ȳi· − ȳ·j + ȳ·· )2
X
=
i=1 j=1
The total sum of squares can be written as

t X
b
(yij − ȳ·· )2
X
Sy,y =
i=1 j=1
b
t X
[(yij − ȳi· − ȳ·j + ȳ·· ) + (ȳi· − ȳ·· ) + (ȳ·j − ȳ·· )]2
X
=
i=1 j=1
t b
(ȳi· − ȳ·· )2 + t (ȳ·j − ȳ·· )2
X X
Sy,y = Smin + b
i=1 j=1
or in words the total sum of squares is equal to the residual sum of squares plus the treatment
sum of squares plus the block sum of squares. This is the basic Analysis of Variance identity
for this model and may be set out in an analysis of variance table. For calculation purposes
we note that
G2
yij2 −
XX
Sy,y =
i j bt
Ti2 G2
P
i
SS(T ) = −
b bt
56
Bj2 G2
P
i
SS(B) = −
t bt
where SS(T) is the sum of squares due to treatments and SS(B) is the sum of squares due
to blocks. The residual sum of squares can be found by subtraction Under the normality
assumption we can test the effects of treatments and blocks independently.
In order to test for no treatment differences
H0 : α1 = α2 = . . . = αt
we use the statistic
SS(T )/(t − 1) t−1
F = ∼ F(b−1)(t−1)
Smin /(t − 1)(b − 1)
under the null hypothesis H0 .
Similarly to test for block effects
H00 : β1 = β2 = . . . = βb
we use the statistic
SS(B)/(b − 1) b−1
F = ∼ F(b−1)(t−1)
Smin /(t − 1)(b − 1)
under the null hypothesis H00 .
Orthogonal contrasts can be used as before for preplanned comparisons of both treat-
ments and blocks when either of these two tests show evidence for differences. Otherwise
use the methods of multiple comparisons (eg Tukey).
As an example consider a randomised block experiment using eight litters of five rats
each. This was conducted to test the effects of five diets on their weight gains in the four
weeks after weaning. Genetic variability in weight gains was thus eliminated by using litters
as blocks. The following weight gains were obtained:
Litter Diet (Treatment) Total
(Block) A B C D E
1 57.0 64.8 70.7 68.3 76.0 336.8
2 55.0 66.6 59.4 67.1 74.5 322.6
3 62.1 69.5 64.5 69.1 76.5 341.7
4 74.5 61.1 74.0 72.7 86.6 368.9
5 86.7 91.8 78.5 90.6 94.7 442.3
6 42.0 51.8 55.8 44.3 43.2 237.1
7 71.9 69.2 63.0 53.8 61.1 319.0
8 51.5 48.6 48.1 40.9 54.4 243.5
Total 500.7 523.4 514.0 506.8 567.0 2611.9
The sums of squares are given by
Sy,y = 57.02 + 64.82 + · · · + 54.42 − (2611.9)2 /40 = 7584.1
SS(B) = (336.82 + · · · + 243.52 )/5 − (2611.9)2 /40 = 6099.5
SS(T ) = (500.72 + · · · + 567.02 )/8 − (2611.9)2 /40 = 346.9
so that the ANOVA table is
57
Source d.f. SS MS F P
Litters (Blocks) 7 6099.5 871.4 21,46 P < .001
Diets (Treatments) 4 346.9 86.7 2.14 .05 < P < .10
Residual 28 1137.7 40.6
Total 39 7584.1
5.7.2 Two-Way Model with Interaction and Replication

This design is appropriate for two factor experiments. Suppose the design is completely
randomised with treatments having a factorial structure. Suppose treatment factor A has a
levels and treatment factor B has b levels. If each treatment is a combination of these factors
then there are ab treatments, each replicated r times.
As an example consider the following experiment (we shall refer to this as the poisons
data). The data are survival times in 10 hour units of groups of four animals randomly
allocated to three poisons and four antidotes. The experiment was part of an investigation to
combat the effects of certain toxic agents and can be thought of as a completely randomised
design with four replications on each of twelve treatments arranged in a 3 × 4 Factorial
structure.
Antidotes
Poison A B C D
1 0.31 0.82 0.43 0.45
0.45 1.10 0.45 0.71
0.46 0.88 0.63 0.66
0.43 0.72 0.76 0.62
2 0.36 0.92 0.44 0.56
0.29 0.61 0.35 1.02
0.40 0.49 0.31 0.71
0.23 1.24 0.40 0.38
3 0.22 0.30 0.23 0.30
0.21 0.37 0.25 0.36
0.18 0.38 0.24 0.31
0.23 0.29 0.22 0.33
This design is also appropriate for a randomised block design with replication, where a
treatments are allocated to units in b blocks with each treatment replicated r times in each
block. There are br observations per treatment and ar observations per block. Hence the
total number of observations n = abr.
The model we use to analyse such experiments is as follows.
yijk = µ + αi + βj + (αβ)ij + εijk
for i = 1, . . . , a, j = 1, . . . , b and k = 1, . . . , r, where yijk is the response of the kth unit
assigned to treatment A at level i and treatment B at level j, µ is an overall mean, αi is the
main effect of treatment A at the ith level, βj is the main effect of treatment B at the jth
level, (αβ)ij is the interaction of treatments A and B at the ith and jth levels and εij is the
error assumed independent and normally distributed with mean 0 and variance σ 2 .
We shall adopt the following notation
58
Overall ith level jth level ith level of A
of A of B +jth level of B
No. of obs. n = abr br ar r
Total G Ai Bj Tij
Mean ȳ··· ȳi·· ȳ·j· ȳij·
To find the least squares estimates we must minimise
a X
b X
r
(yijk − µ − αi − βj − (αβ)ij )2
X
S=
i=1 j=1 k=1
∂S
Setting ∂µ
= 0, we see that
XXX
(yijk − µ − αi − βj − (αβ)ij ) = 0,
i j k
so that X X XX
abrµb + br α
b i + ar βbj + r (αβ)
d
ij = G. (5.6)
i j i j
∂S
Similarly setting ∂αi
= 0, we see that
XX
(yijk − µ − αi − βj − (αβ)ij ) = 0,
j k
so that X X
brµb + brα
bi + r βbj + r (αβ)ij = Ai , (5.7)
j j
∂S
setting ∂βj
= 0, we see that
XX
(yijk − µ − αi − βj − (αβ)ij ) = 0,
i k
so that X X
arµb + r α
b i + ar βbj + r (αβ)
d
ij = Bj . (5.8)
i i
∂S
Finally setting ∂(αβ)ij
= 0, we see that
X
(yijk − µ − αi − βj − (αβ)ij ) = 0,
k
so that
rµb + rα
b i + r βbj + r(αβ)ij = Tij . (5.9)
P P P P
Since (5.6) = i j (5.9), (5.7) = i (5.9) and (5.8) = j (5.9), although there are 1 +
a + b + ab parameters there are only ab independent equations. We can solve the equations
Tij
uniquely by putting µ̂ = α b i = βbj = 0 which from (5.9) implies that (αβ)
d
ij = r = ȳij· i=
1, . . . , a j = 1, . . . , b. This is just one solution, but every solution gives the same fitted
values, residuals and residual sum of squares.
59
The fitted values are µb + α b i + βbj + (αβ)
d
ij = ȳij· for any unit assigned the ijth treatment,
i.e. ȳij· is just the ‘cell’ mean. The residuals are yijk − ȳij· . Hence the residual sum of squares
is
a X
b X
r
(yijk − ȳij· )2
X
Smin =
i=1 j=1 k=1
This represents the sum of squares within the treatment groups. There are ab groups, each
with r observations, Smin has ab(r − 1) degrees of freedom.
Another possible solution is
µ̂ = ȳ··· b i = ȳi·· − ȳ···

α βj = ȳ·j· − ȳ···
(αβ)ij = ȳij· − ȳi·· − ȳ·j· + ȳ···

d
Note that this solution gives the same fitted values
µb + α
b i + βbj + (αβ)
d
ij = ȳij·
and sums of squares.

The residual sum of squares is given by
a X r
b X
(yijk − ȳij· )2 .
X
Smin =
i=1 j=1 k=1

a X r
b X
(yijk − ȳ··· )2
X
Sy,y =
i=1 j=1 k=1
a X r
b X
[(yijk − ȳij· ) + (ȳij· − ȳi·· − ȳ·j· + ȳ··· ) + (ȳi·· − ȳ··· ) + (ȳ·j· − ȳ··· )]2 .
X
=
i=1 j=1 k=1
Hence, noting that the cross-product terms vanish, the ANOVA identity is
a X
b a b
(ȳij· − ȳi·· − ȳ·j· + ȳ··· )2 + br (ȳi·· − ȳ··· )2 + ar (ȳ·j· − ȳ··· )2
X X X
Sy,y = Smin + r
i=1 j=1 i=1 j=1
or in words, total sum of squares is equal to the residual sum of squares plus the interaction
sum of squares plus the sum of squares due to the main effect of factor A plus the sum of
squares due to the main effect of factor B.
In deciding whether we can simplify the model, we always begin by testing the interaction.
H0 : (αβ)ij are all the same (no interaction between A and B)
We use the test statistic

SS(A ∗ B)/(a − 1)(b − 1) (a−1)(b−1)
F = ∼ Fab(r−1) under H0
s2
60
If H0 is rejected, finish the analysis presenting the table of means (after adequacy checks).
If H0 is not rejected (ie data shows no evidence for interaction) test
H00 : αi are equal, no main effect of factor A
and
H000 : βj are equal, no main effect of factor B
using the statistics
SS(A)/(a − 1) a−1
F1 = 2
∼ Fab(r−1) under H00
s
and
SS(B)/(b − 1) b−1
F2 = 2
∼ Fab(r−1) under H000
s
The total sum of squares may also be written as
a X
b X
r a X
b
(yijk − ȳij· )2 + r (ȳij· − ȳ··· )2
X X
Sy,y =
i=1 j=1 k=1 i=1 j=1
which is the sum of residual sum of squares and treatment SS on ab(r − 1) and ab − 1 degrees
respectively.
The results can be presented in an ANOVA table. For calculation purposes we proceed
P P P 2
as follows. Firstly calculate the total sum of squares as i j k yijk − G2 /n. Next calculate
the treatment sum of squares as if it were a one-way model that is ignoring the two-way
structure. This is
G2
P P
Tij
SS(T ) = i j − .
r n
Next calculate the sum of squares due to factor A as
A2i G2
P
i
SS(A) = −
br n
and the sum of squares due to factor B as
Bj2 G2
P
j
SS(B) = − .
ar n
The interaction sum of squares is given by SS(A ∗ B) = SS(T ) − SS(A) − SS(B) Finally
the residual sum of squares is given by the total sum of squares minus the treatment sum of
squares.
The ANOVA table is
Source df SS MS F
SS(T ) M S(T )
Between treatments ab − 1 SS(T ) ab−1 s2
SS(A) M S(A)
Factor A a−1 SS(A) a−1 s2
SS(B) M S(B)
Factor B b−1 SS(B) b−1 s2
SS(A∗B) M S(A∗B)
Interaction (a − 1)(b − 1) SS(A ∗ B) (a−1)(b−1) s2
2
Residual ab(r − 1) RSS s
Total abr − 1 TSS
61
As an example we shall analyse the poisons data. Following the scheme for calculation
set out above, we find that the ANOVA table is given by
Source df SS MS F
Between treatments 5 2.2044 0.4409 19.860
Antidotes 3 0.9212 0.3071 13.806
Poisons 2 1.0330 0.5615 23.222
Interaction 6 0.2502 0.0417 1.875
Residual 36 0.8007 0.0222
Total 47 3.0051
6
The interaction is not-significant (F36 (0.1) = 1.945), so we go on to test the main effects of
Antidotes and Poisons. There is strong evidence (P < 0.001) for differences in both Poisons
and Antidotes. Further analysis could be by orthogonal contrasts or multiple comparisons.
The R output for this set of data is as follows

antidote 3 0.92121 0.30707 13.8056 3.777e-06 ***
poison 2 1.03301 0.51651 23.2217 3.331e-07 ***
antidote:poison 6 0.25014 0.04169 1.8743 0.1123
Residuals 36 0.80072 0.02224
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
5.8 Worked examples

1 Observations x1 , x2 , . . . , xn have sample mean x̄. Show that
(a)
n n X
n
(xi − x̄)2 = (xi − xj )2 /2n.
X X
i=1 i=1 j=1
(b) In the equi-replicate one-way model to compare t treatments using r replicates per
treatment, show that the between treatments mean square (Mt ) and the residual mean
square (s2 ) can be written in the forms
t X t
r
(Ȳi. − Ȳj. )2
X
Mt =
2t(t − 1) i=1 j=1
 
t X r X r
1 
s2 = (Yij − Yik )2
X
2tr(r − 1) i=1 j=1 k=1 
Solution.
62
(a) We can write
(xi − x̄)2 = (xi − xj + xj − x̄)2

= (xi − xj )2 + (xj − x̄)2 + 2(xi − xj )(xj − x̄)
= (xi − xj )2 + (xj − x̄)2 + 2xi (xj − x̄) − 2xj (xj − x̄).
Summing both sides over i and j gives
(xi − x̄)2 = (xi − xj )2 + n (xj − x̄)2 + 2

X XX X X X X
n xi (xj − x̄) − 2n xj (xj − x̄).
i i j j i j j
Now j (xj − x̄) = 0, j xj (xj − x̄) = − x̄)2 and − x̄)2 = − x̄)2 .

P P P P P
j (xj i (xi j (xj
Therefore rearranging we see that
(xi − x̄)2 = (xi − xj )2

X XX
2n
i i j
and the required result follows.
(b) The between treatments mean square (Mt ) is given by

Pt
r − ȳ·· )2
i=1 (ȳi·
Mt = .
t−1
Using the result proved above
t t X t
1 X
(ȳi· − ȳ·· )2 = (ȳi· − ȳj. )2
X
i=1 2t i=1 j=1

Similarly the residual mean square is given by
Pt Pr
2 i=1 j=1 (yij − ȳi· )2
s = .
t(r − 1)
Applying the result proved above

r r X r
1 X
2
(yij − yik )2
X
(yij − ȳi· ) =
j=1 2r j=1 k=1
5.9 Further reading and examples

Chapters 14, 15 17 (not 17.4) of BOC, Exercises 14.2, 14.3, 14.5a,c, 14.29, 15.1, 15.5, 17.1,
17.4.
63

Ms 236 N 0

Uploaded by

Copyright:

Available Formats

Ms 236 N 0

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ms 236 N 0

Uploaded by

Copyright:

Available Formats

Contents

1 Ideas of Statistical modelling 5

2 Simple Linear Regression 7

Ideas of Statistical modelling

Figure 1.1: Scatter diagram of data in Table 1.1

Simple Linear Regression

2.1 The simple linear regression model

2.2 Least Squares Estimators

So the least squares estimates α

To find the estimate for α we divide (2.2) by n and rearrange to obtain

We can also write it as

2.3 Properties of the Estimators

where bi = 1/n − ai x̄, thus

The residuals (also called crude residuals) are defined as ei = Yi − µb i , thus

2.4.1 Properties of Residuals

The mean of the ith residual is

The variance is given by

2.4.2 Other residuals

Definition We define the standardised residual

2. Fitted values µb i – a systematic pattern (particularly a funnel shape) indicates the

3. Other explanatory variables – a systematic pattern indicates the variable should be

5. normal order statistics - normal plot – a curved line indicates non-normality.

2.6 Inferences about the regression parameters

and when σ 2 is unknown

sum of squares due to regression/1 1

We reject H0 if this quantity is too large.

where SS stands for sums of squares and MS for mean square.

There are two cases to consider:

2. We wish to predict the response of an individual with x = x0 .

= σ 2 /n + 0 + (x0 − x̄)2 σ 2 /Sxx

So the distribution of the linear predictor is

3. |x0 − x̄| is small

Residual standard error: 2.316 on 13 degrees of freedom

Residual standard error: 2.316346

2.9 Worked example

Yi = α + β(xi − x̄) + εi where i = 1, 2, . . . , n

(a) Find the least squares estimates α,

(b) Show that α,

(a) We have to minimise

Differentiating with respect to β and setting equal to zero gives

which implies that

α + β(xi − x̄) = (α − β x̄) + βxi

and use the previous results.

2.10 Further reading and exercises

where Su,v = uv − ( u v)/n.

This gives us two equations in two unknowns which we solve to obtain

Dispensers Sales Dispensers Sales

Residual standard error: 7.858 on 11 degrees of freedom

Residual standard error: 7.857946

Residual standard error: 7.864 on 10 degrees of freedom

Df Sum Sq Mean Sq F value Pr(>F)

3.3 Worked example

Solution. The summary statistics for these data are:

The estimate of σ 2 is s2 = 95.225 and hence the estimated variance of [β] b (= s2 /S )

3.4 Further reading and exercises

The residuals are given by

and plot them to see if there is a relationship.

3. Quadratic regression (p = 3).

Yi = β0 + β1 x1i + β2 x2i + εi for i = 1, . . . , n