A Falsifying Rule For Probability Statements

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

The British Society for the Philosophy of Science

A Falsifying Rule for Probability Statements


Author(s): Donald A. Gillies
Source: The British Journal for the Philosophy of Science, Vol. 22, No. 3 (Aug., 1971), pp.
231-261
Published by: Oxford University Press on behalf of The British Society for the
Philosophy of Science
Stable URL: http://www.jstor.org/stable/686745
Accessed: 25-08-2016 02:13 UTC

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted
digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about
JSTOR, please contact [email protected].

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
http://about.jstor.org/terms

Oxford University Press, The British Society for the Philosophy of Science are
collaborating with JSTOR to digitize, preserve and extend access to The British Journal for the
Philosophy of Science

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
Brit. J. Phil. Sci. 22 (197I), 231-261 Printed in Great Britain 231

A Falsifying Rule for Probability


Statements'

by DONALD A. GILLIES

i. Introduction
2. Formulation of the Falsifying Rule
3. Criticism of the Neyman-Pearson theory
4. A reply to some objections of Neyman's

I INTRODUCTION

MY AIM in this paper is to discuss a problem rais


vIII of the Logic of Scientific Discovery. This
explained. According to Popper a scientific theor
being falsified by evidence. Now statistical th
reputable science; yet they are not, strictly spea
frequency evidence which is used in practice t
himself puts it ([8], p. 146):

The relations between probability and experience


clarification. In investigating this problem we shall d
seem an almost insuperable objection to my methodolo
probability statements play such a vitally important
they turn out to be in principle impervious to strict
stumbling block will become the touchstone upon wh
order to find out what it is worth.

It is very easy to see why probability statements are in principle


impervious to strict falsification. Let us take the simplest example.
Suppose we are tossing a coin and take as our statistical hypothesis H that
the tosses are independent and prob(heads) = p, where o <p < i.
A typical way of testing H would be to toss the coin a large number n of
times, and observe the number m of heads. But now the probability of
getting m heads (prob(m) say) is given by the binomial formula:
prob(m)= nCmpm(I _p)n-m
Received I December 197o
1 Previous versions of this paper were read in seminars at the London School of Economics,
and I am most grateful to those who offered comments on these occasions-particularly
to Colin Howson, Imre Lakatos, Sir Karl Popper, Alan Stuart and John Watkins.
That is not to say of course that these critics would agree with the views here expressed.

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
232 Donald A. Gillies

It follows that no observed value of m is strictly ruled out by H. There is a


finite probability of getting any number m of heads. Admittedly some
values of m have a low probability given H, but no one is impossible.
Thus H is not, in a logical sense, falsifiable by the evidence.
Popper's solution to this problem is that, although probability hypotheses
are strictly speaking unfalsifiable, they can nonetheless be used as falsifiable
hypotheses, and indeed are so used in reputable science. As Popper puts
it ([8], p. 204):

the physicist knows well enough when to regard a probability assumption as


falsified.

Of course Popper's point is not only an empirical claim about the behaviour
of scientists but a normative proposal. He not only holds that good scientists
use probability statements as falsifiable statements, but that all scientists
ought to use them in this way.
This proposed solution raises a problem which I propose to call 'the
problem of producing an F.R.P.S. (or Falsifying Rule for Probability
Statements)'. In other words assuming that Popper is right and statis-
ticians and physicists do use probability statements as falsifiable state-
ments, the problem is to give explicitly the methodological rules which
implicitly guide their handling of probability. Of course once again such a
rule will not be merely a description of good scientific practice, but rather
a normative proposal that probability should be dealt with in accordance
with its dictates.
In the next section we will formulate such a rule, and examine how far
it agrees with statistical practice. In fact the suggested rule largely agrees
with the standard statistical tests (the X2-test, the t-test, and the F-test),
but it contradicts the Neyman-Pearson theory of testing. As this theory
is still generally accepted among statisticians, the situation will conse-
quently look black for our proposed rule. However rather than abandon
the rule, we will proceed in section 3 to criticise the Neyman-Pearson
theory. Finally in section 4 we will attempt to answer some objections of
Neyman's. Evidently these were not directed against the view developed
here, but they had as their object theories of testing of the same general
type as the one advocated.1
1 It will be clear from this introduction that we take the notion of a falsifiable theory as
basic. However the falsifiability criterion has recently been criticised by Lakatos in his
paper [4]. Lakatos proposes analysing the growth of science in terms of 'unfalsifiable
research programmes' rather than 'falsifiable theories'. I believe that much of the dis-
cussion of this paper can be reformulated within Lakatos' framework. The problem
becomes that of formulating a rule telling us when a statistical sample becomes an
anomaly for the underlying programme. However we will not discuss these general
philosophical questions in what follows.

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
A Falsifying Rule for Probability Statements 233

2 FORMULATION OF THE FALSIFYING RULE

Rather than give our suggested version of an F.R.P.S.


will consider some preliminary versions which migh
suggest themselves. By criticising and improving thes
to the rule which will finally be advocated. The first
well present itself is this. We should regard a probabi
as falsified if the observed result given H has a very l
Put a little more formally we get a falsifying rule whic
R.I. Suppose that, given a statistical hypothesis H, i
certain event w has a probability p where p < po
suitably small value. Then if w is observed to occur
H as falsified from a practical point of view.'
Suppose we are tossing a coin and have the hypothes
are independent with prob(heads) = ?. Suppose we have
and observe exactly the same number of heads and tails.
this event is of course

1,0oooC500,00oo ()1ooo00,000 = 0-0008 (to I sig. fig).


Now o.ooo8 is a very small probability. Therefore applying R.I with
say p0 = io-3 we would be forced to conclude that the observation
falsified the hypothesis. In practice however one would draw the opposite
conclusion, namely that the evidence was almost the best possible in
support of the hypothesis. Worse still the probability of any other pro-
portion of heads and tails is less than 0o.oo0008, and so applying R. I we would
have to regard H as falsified whatever happened.
It is not difficult to see the error involved here. What is important is not
the actual probability of a particular outcome, but the relation between its
probability and the probabilities of other possible outcomes. Thus the
actual value of 0o.oo0008 is low, but it is nonetheless very much greater than
the probability of getting 1,ooo,ooo000 heads. This suggests that we should
introduce a measure of the probability of a given event relative to the
other possible events; and only regard a hypothesis H as falsified if, relative
to H, the observed event has a low value on this measure. In order to do
this we must first formulate the problem in a manner which is rather more
precise mathematically.
We will suppose that from our statistical hypothesis H we deduce that
a certain I-dimensional random variable k has a distribution D. We will
further require that D is either a discrete or continuous distribution. In
general we shall denote the range of 6, i.e. the possible values it can take,
1 Something like this rule seems to have been suggested by D'Alembert. See Todhunter
[Io], pp. 262-5.

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
234 Donald A. Gillies

by R(f). In the discrete case R(f) will consist of a finite or denumerable


set (x,)(i = I, 2, ...) with probabilities p(f = xi) = p(x,). In the con-
tinuous case R(:) will be some subset of the real line, and we will suppose
( has a probability density function f(x). For reasons which will become
clear later, we will require that the maximum value of f(x) is finite. Our
problem is to decide which possible values of f (i.e. which members of
R(()) should, if observed, lead us to reject the hypothesis that ( has the
distribution D and hence the original statistical hypothesis H.
This formulation obviously corresponds very closely to the usual case
of statistical testing. There we have a random sample , ... drawn from
a population for which we have postulated a hypothetical probability
distribution. We calculate a statistic y(, ..., Es) which is a function of
,, -.-.- . Since the fi are r.v.'s, ~ is a I-dimensional random variable, and
its distribution D can be calculated from the population distribution. In
practice D is always a discrete or a continuous distribution. The problem
is then whether the observed value of l is compatible with its hypothetical
distribution D. Sequential testing procedures can also be fitted into this
framework. Finally we can remark that the restriction to I-dimensional
random variables is not essential. However it greatly simplifies the problem
and really involves no great loss of generality. Suppose we have a number
of r.v.'s f, ), ... with joint distribution D'. We can, as already indicated,
consider a I-dimensional statistic 5((, 4, ...) whose distribution D is
determined from D'.

We now introduce the concept of the relative likelihood 1(x) of a possible


result x e R((). This is defined as follows. Let pmax be the maximum
value of the p(x,) in the discrete case, and fmax be the maximum value of
f(x) in the continuous case. We then set

1(x) = p(x)/pmax in the discrete case


= f(x)/fmax in the continuous case.

1(x) gives a measure of the probability of an observed event in relation to


the other possible events. We can now give our second version of an
F.R.P.S. which depends on this measure.
R.2. If we observe a value x of f with 1(x)< <o where 10 is some
suitably small value, then we must regard H as falsified.
This rule is a considerable improvement on the previous one. To begin
with it avoids the simple coin-tossing counter-example. The result 500,000
heads has an I value of I-the greatest possible. Whereas the I value of say
i,ooo,ooo heads is very low. An observation of the latter, but not of the
former, would lead to a rejection, as we intuitively require. Nonetheless
R.2 in its turn falls to a rather more sophisticated counter-example.

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
A Falsifying Rule for Probability Statements 235

Suppose the r.v. I can take any of the integral values o, I, 2, ..., 9, 900.
Suppose its distribution D is given by
p( = o)= o.o0
p($ = i) = I O-4 (i = I, 2, ..., 9,900)
for i = I, 2, ..., 9,900 we have 1(i) = Io-2
Assuming that this is small enough to give a falsification, we have in
accordance with R.2 that H should be rejected if we get a value i with

I < i < 9,900, and accepted only if we obtained - = o. But then if H


is true there is a probability of o0.99 or 99 per cent that it will be regarded
as experimentally falsified and of only I per cent that it will be regarded as
experimentally corroborated. This is clearly unsatisfactory.
Once again it is not difficult to see the root of the trouble here. We must
require not only that the observed result x should have a small relative likeli-
hood but that it should be untypical in this. If nearly all the possible results
had low relative likelihoods, as in the counter-example just given, then we
would actually expect to get a result with low relative likelihood and would
not regard it as falsifying the hypothesis. We must therefore in each case
consider the class of possible results x with 1(x) < lo, and require that this
class has a low probability given H. In this way we are led to the familiar
statistical concept of a critical region. Consider the range R(() of 6. We
can look on the problem as that of finding a subset C of R($) s.t. if the
observed value of $ (x say) is in C, we regard H as falsified. C is then called
a 'critical region' or a 'falsification class'. Let us call the probability that
we get a result x e C, k(C) (or simply k). We must then require that k
is small-say k < k0. In addition we have to formulate the considerations
of relative likelihood already introduced. Let us define the relative likeli-
hood of any arbitrary subset B of R(() by the formula
max
I(B) = defxeB1(x).
The relative likelihood of the critical region, I(C), we shall usually denote
simply by 1. Our previous discussion now leads to the requirement that
1 < o for some suitably small value 4o. Finally let us denote the set
(R(5)- C) by A. A is the 'acceptance region' of the test. It seems reasonable
to require that the relative likelihood of any of the results x in A should
be greater than the relative likelihood of any result in C. Putting all this
together we reach our third version of an F.R.P.S.
R.3. Suppose that the range R(() of 6 can be partitioned into two
disjoint subsets C and A with C u A -= R(f) and s.t.
(i) prob(( e C) = k < k0
(ii) I(C) = 1 < o, where k0 and 4 are suitably small constants &

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
236 Donald A. Gillies

(iii) 1(x)> I for all x e A.

WeC.shall
in then regard H as falsified if the observed value of s, x say, lies

This rule is very near our final version, but it is not satisfactory because
the requirement (iii) turns out to be too weak. To see this, consider the
following counter-example which is a modification of our previous one.

Suppose a r.v. s can take on the integral values o, I, 2, ..., 9,940. Suppose
further 5 has the distribution D defined by

p(= o) = o-oi
p(- = i) - io0-4 (i = I, 2,..., 9,540)
P( = i) = 9 x IO0-5 (i 9,541, ..., 9,940).
Suppose we now set A (o, i, 2, ..., 9,540o)
C = (9,54, ..., 9,940)
Then

(i) k(C)= 0.036 < 0.05 (the 5 per cent level)


(ii) 1(C) = 9 x io-3 (presumably a sufficiently low value) &
(iii) 1(x) > 1 for x EA.
Thus the partition (A, C), as defined above, satisfies the requirements of
R.3. It does not however strike me as satisfactory from an intuitive point
of view. My reason is that the results of the rejection class have relative
likelihoods which are not significantly less than most of the results in the
acceptance class A. It does not seem sensible to reject the hypothesis if
we get a result with relative likelihood 9 x io- when nearly all the results
have a relative likelihood (lo-2) which is only marginally greater. What
makes this case unsatisfactory is that the value pmax is in no way typical of
the majority of probabilities of results in the acceptance class. In fact it is
very untypical. Consequently I suggest that condition (iii) in R.3 be
strengthened to state that pmax is in some sense representative of the proba-
bilities p(x) for x e A. If we add this qualification to R.3 we obtain the
F.R.P.S. which will in fact be advocated in what follows. We will now
try to state it in a reasonably precise fashion.
It will prove convenient to introduce a new concept in terms of which
the F.R.P.S. can be simply stated. Our procedure is to partition R(e) into
two disjoint sets C and A which satisfy certain conditions. Now it will
obviously not always be possible to do this for a random variable with any
arbitrary distribution. Where it is possible we shall speak of a random
variable with a falsifiable distribution. More precisely this can be defined
as follows. We shall say that a i-dimensional random variable has a
falsifiable distribution if it is possible to partition R(5) into disjoint sets
A and C with A U C = R(f) where

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
A Falsifying Rule for Probability Statements 237

(i) prob(6 e C) = k < ko


(ii) 1(C) = / < lo where k0, lo are suitably small constants &
(iii) the value pmax (or fmax in the continuous case) is in some sense
representative of the probabilities (resp. probability densities) of points
x e A. If a r.v. 5 has a falsifiable distribution D, we shall call a set C a
'critical region' or 'falsification class' associated with D, and a set A an
'acceptance region' associated with D. In terms of these concepts we can
now formulate our falsifying rule as follows.
F.R.P.S. If from a statistical hypothesis H we can deduce that a r.v.
has a falsifiable distribution D, and if C is a critical region associated with
D, then if a value x of $ with x E C is observed we regard H as falsified.
We will now make a number of comments about this F.R.P.S.

First of all when we test H by means of $ we can be said to be predicting


e A, and are regarding our prediction as falsified if ( g A. In the determi-
nistic case 'predicting' and 'explaining' are in a sense symmetrical. This
symmetry can be restored in the statistical case if we introduce the notion
of 'deducibility in accordance with the falsifying rule' or 'f-deducibility'
for short. This can be made precise as follows. Suppose that from a
statistical hypothesis H we deduce (logically) that a certain r.v. ( has a
falsifiable distribution D with associated acceptance region A. We shall
then say that ( eA is f-deducible from H. We will say that a certain
frequency phenomenon is explained by a statistical hypothesis H if the
frequency phenomenon can be described by a statement s of the form
'e A' where s can be f-deduced from H. If we extend the concept of
deducibility to include f-deducibility, we thereby preserve the so-called
'deductive model of explanation'. The symmetry between 'explaining'
and 'predicting' is again clear. In both cases we f-deduce that e A.

If se A has already been established experimentally, we say that it has


been explained. If it has not been so established, we have a prediction.
If ( is then in fact observed to be in A, the prediction is verified and the
hypothesis corroborated. If however we find ( B A, the prediction and
consequently the hypothesis is falsified. These points are straightforward,
but nonetheless important. Most statisticians who have considered this
problem have concentrated their attention on testing statistical hypotheses
rather than on using such hypotheses to explain frequency phenomena.
As a result there are certain theories of testing which allow statistical
hypotheses to be tested, but do not allow us to use such a hypothesis to
explain an observed phenomenon. Such theories cannot in my view be
regarded as satisfactory.
It is often said that statistical laws tell us nothing about the particular
case, but only about what happens in a large number of similar cases. This

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
238 Donald A. Gillies

is not a very satisfactory formulation since a large number of similar


trials can be considered as a single combined trial, and statistical laws do
tell us what happens on trials of this form. The notion of a falsifiable
distribution helps to make clear what truth there is in this remark. Suppose
we have a random variable k. We can only make prediction about the
observed value of f if f has a falsifiable distribution, and in that case we
can f-deduce and hence predict that e A, where A is an acceptance
region. Let v be a random variable denoting the number of heads observed
in a coin-tossing experiment. Suppose we toss the coin just once, then v
does not have a falsifiable distribution. In fact v has the uniform distribution

prob(v = o) = prob(v = 1)= - which is almost the paradigm case of a


non-falsifiable distribution. Suppose we toss the coin a large number n
of times, then v will have a falsifiable distribution and we will be able to
make predictions about how many heads will be observed.
Let us now see how the F.R.P.S. applies to a simple but very important
case. Suppose 5 has a continuous, unimodal 'bell-shaped' distribution of
approximately normal form, as shown in Fig. I.
Obviously the F.R.P.S. will instruct us to choose an acceptance interval
of the form [a, b] and a corresponding critical region (- co, a) u (b, + oo).
In other words to divide the range of s into a 'head' and two 'tails', and
to reject H only if we obtain an observed value of f in the tails. We could
term this procedure: 'cutting off the tails of the distribution'. We have

k = Jf(x) dx+ f(x) dx

and I = max (f (a)/fmax, f(b)/fmax).

Let us now examine the significance of the condition that fmax should,
in some sense, 'be representative of the values of f(x) in [a, b]'. Another
way of putting this is as follows. We require that f(x) should have a low
value in the 'tails' of the distribution, but once we enter the 'head' or
'acceptance region' [a, b], we would like f(x) to rise to its maximum value
fmax as quickly as possible. So we require that f should increase swiftly
once inside the region [a, b]; but how swiftly? I believe that, at the cost
admittedly of some arbitrariness, we can give a precise answer to this
question. This criterion then enables us to say definitely whether a
distribution of the continuous, unimodal, form is falsifiable or not, and
if it is falsifiable to divide it into a 'head' and 'tails'. This procedure is of
course very useful in comparing the recommendation of our F.R.P.S.
with the methods actually used in statistical practice. However we will
not here give the details of this further attempt at preciseness. Rather we
will assume that the qualitative considerations given to date enable us,

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
A Falsifying Rule for Probability Statements 239

roughly at least, to divide continuous unimodal distributions into those


which are falsifiable and those which are not and to define the boundaries
of the related acceptance regions.
We must now say a few words about the interpretation of the constants
k and 1. In the usual approach the constant k (the size of the critical region)
is wholly arbitrary. We have to begin by selecting a suitable value, and if
we then want to change to another value we can correspondingly alter our
falsification class without difficulty. On our own approach, it is not such
an easy matter to alter k. Suppose we want to test a hypothesis H by
observing one value of a random variable E. Suppose further that ( has a
continuous, unimodal, distribution. For the test to be possible at all we
must first have that ( has a falsifiable distribution. Suppose this is so.

f max

f(a) f(b)

a b

Fig.

Then
divide
proba
the d
to ch
so by
given
whose
k and
test in
case w

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
240 Donald A. Gillies

observation and I is an interval. In the statistical case we make a similar


prediction, but only with some k- and i-values. The deterministic case can
be taken as a limiting case of this with k = 1 = o, and in general the values
of k and I give a guide as to how much the statistical differs from the
deterministic case. Of course we have still got to fix the crucial values
k0 and lo, i.e. the degree of divergence from the deterministic case which we
consider allowable. However the preceding considerations about the
geometry of distributions give us some guide on this too.
The argument is this. It will be granted that the normal distribution
plays a fundamental role in statistics-for example in the theory of errors.
Now the normal distribution is evidently of falsifiable form. If we divide
its range into an acceptance region and a critical region, using the rough
geometrical considerations explained above, we obtain k = 3 per cent,
I = Io per cent. As we will naturally expect the normal distribution to
have better properties than can be hoped for in general, this suggests
giving k0 and 10 rather higher values than the above ones. We are thus led
to the suggestion k0 = 5 per cent, 4 - 15 per cent. This provides some
kind of justification, albeit a weak one, for the 5 per cent value customarily
used by statisticians.
We are now in a position to examine whether or not our proposed rule
agrees with statistical practice. I think it is clear that the rule accords well
with the standard statistical tests. Consider the X_-, t-, and F-distributions.
For sufficiently high degrees of freedom, these are all of the continuous,
unimodal form already discussed. Using our rough geometrical considera-
tions, we can divide their ranges into a head and tails, and it turns out
that the corresponding k- and i-values are of the right order. In this way
the X2-, t-, and F-tests are justified from our point of view.
However at the same time a certain difference between the present
approach and standard practice makes itself manifest. Sometimes a
one-tailed version of say the t-test is used. Now this is certainly illegitimate
from our point of view. Suppose the tail used as the critical region C in
the one-tailed test is the right-hand one. Then we can certainly find
points x in the acceptance region s.t. l(x) < 1(C). Points in the left-hand
tail of the curve can be chosen for this. This contradicts even the weaker
preliminary version R.3 of our falsifying rule, and thus evidently the final
stronger version.
One-tailed tests are used because they are suggested by the Neyman-
Pearson theory. So a difference between the present approach and the
Neyman-Pearson theory has come to light. We will now examine the
relations between the two accounts in rather more detail.
Our F.R.P.S. forced us in the continuous unimodal case just discussed

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
A Falsifying Rule for Probability Statements 241

to take for our falsification class the 'tails' of the distribution. Only if we
chose the tails could we obtain a low i-value as required. Consider however
a falsifying rule of the following form (which we shall call R.4).

R.4. Suppose the range R(k) of a random variable s with postulated


distribution D can be partitioned into two disjoint sets A and C
with A U C = R(f), and suppose prob(C) = k < k0 where k0 is
some suitable constant. Then if we observe ( e C, the hypothesis
that ( has distribution D must be taken as falsified.
With such a rule we would not be restricted to the 'tails' of the distribu-
tion, but could for example in the continuous case choose for C any interval

[c, d] satisfying f(x) dx = k < ko.


Some 'counter-intuitive' choices of this type are illustrated in Fig. 2.
26

I
I

I
I
I

I
I

I
I
I
I
I II
i

I II
I I
I I tI

r p

Fig. 2. Sho

Suppose ag
which prob
large numb
normally
Now if we

(p--8, p+
k < k0. Co
extremely
us to rejec

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
242 Donald A. Gillies
r +6

r+ 8') say, where 8' is chosen so that f(x) dx = k < ko.

such a procedure is completely arbitrary.


We can now relate our approach to the Neyman-Pearso
Roughly speaking Neyman and Pearson begin by adopting a fals
of the form R.4. They then note the paradoxical choices of rejectio
in Fig. 2. Then in order to escape the difficulty, they add a ne
to the effect that we cannot test a given hypothesis in isolation
against a set of alternative hypotheses. Developing from this po
show how the counter-intuitive falsification classes can be ruled out.
Our own procedure is somewhat different. We eliminate these counter-
intuitive falsification classes by requiring a low 1- as well as a low k-value.
We say in effect that the falsification class must have a low relative likeli-
hood as well as a low probability. We thus solve the problem without
appeal to a principle of alternative hypotheses, and it consequently
becomes possible, on our position, to test a statistical hypothesis 'in isola-
tion', i.e. without considering well-defined alternative hypotheses.
A second difference from the Neyman-Pearson approach lies in our
notion of a 'falsifiable distribution'. On our account it is only random

variables s with certain distributions D, whose ranges R(5) can be divided


into an acceptance region A and a critical region C. Suppose however
that we adopted a falsifying rule of the form R.4. Then any random
variable ( with any distribution D could have its range R(5) partitioned
into an A and C. There would be no distinction between 'falsifiable' and
'non-falsifiable' distributions.

The difference is most vividly shown in the case of the rectangular


distribution. Suppose ( has a frequency function f(x) defined by

f(x) = i for -1 < x < 1


= o otherwise.

Adopting R.4 we could choose for our critical region C any interval (a, b)

where -- < a < b < j and b-a -- k, where k is some suitable constant
< k0. On the Neyman-Pearson theory a number of critical regions C
of the form (a, b) are possible depending on which 'class of alternative
hypotheses' we adopt. On our own view however no critical region of the
form (a, b) is allowable. For any such region we have I I, i.e. its maximum

possible value. Hence we certainly shall not have 1 < 10 whatever 'crucial'
value 1o is chosen. I would claim that the non-existence of a falsification
class here is in accordance with intuition. Suppose we have say -- < a <
b < a' < b' < ?, and b'-a' = b-a = k. If we now adopt (a, b) as our

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
A Falsifying Rule for Probability Statements 243

critical region, we will reject the hypothesis if the observed value x e (a, b)
and not if x e (a', b'). Yet as far as the hypothesis under test goes, the
two regions (a, b) and (a', b') are exactly symmetrical. It seems wholly
arbitrary to adopt one and not the other as a critical region. The only
solution is to say that this particular distribution is not one of the falsifiable
kind.

Thirdly we have the point already mentioned, that, on the Neyman-


Pearson theory, we can sometimes have one-tailed tests, where, from our
point of view, only two-tailed tests are valid. These points show that there
is a clear difference between the view presented here and the Neyman-
Pearson theory. Now the Neyman-Pearson theory is the generally accepted
account of testing statistical hypotheses. Thus if the present view is to gain
any credence, we must give reasons why the Neyman-Pearson theory
should be abandoned. This task will be attempted in the next section.

3 CRITICISM OF THE NEYMAN-PEARSON THEORY

The feature of the Neyman-Pearson theory which I propose to attack


could be called 'the principle of alternative hypotheses'. Something like
this principle is also involved in other approaches to Statistics-for
example in Decision theory, and in approaches based on the so-called
'likelihood principle'. Our criticisms will thus apply also to these accounts-
though we will not stress this point. The principle in question states that
when we are testing a given statistical hypothesis H, we can (and should)
devise a set of alternatives to H, and then represent the problem as that
of testing H against these alternatives. Neyman and Pearson state the
principle as follows in their 1933 paper ([7], p. I87):

In any given problem it is assumed possible to define the class of admissible


hypotheses C(H), containing HI, H,,..., as alternatives to H0.

Moreover this principle is clearly necessary for the Neyman-Pearson


theory, because, only by consideration of such alternatives can they rule
out absurd choices of the critical region.
We are going to attack this principle of alternative hypotheses, but this
does not mean that we are opposed lock, stock and barrel to the idea of
devising alternative hypotheses when testing a given hypothesis. On the
contrary the history of science is full of examples of the fruitfulness of
alternative hypotheses. We will content ourselves with mentioning what
is perhaps the most famous example. In the early years of the sixteenth
century, the Ptolemaic theory was a well-worked out account of the
universe which moreover agreed with observation within the experimenta

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
244 Donald A. Gillies

error of the day, except perhaps for a few anomalies. Yet Copernicus
devised an alternative hypothesis which agreed with observation as well
as the Ptolemaic theory. This new hypothesis led to the production of
new tests of the Ptolemaic account, and generally played a large part in
stimulating the enormous scientific advances of the next hundred years.
Granted then that alternative hypotheses can be of such value, why should
I want to attack the principle of alternative hypotheses as it appears within
the Neyman-Pearson theory?
There are really two reasons. First of all although it is often desirable
to devise alternatives when testing a given hypothesis, it is by no means
necessary to do so. There are many situations where we want to test a
given hypothesis 'in isolation', i.e. without formulating precise alterna-
tives.' Indeed it is often the failure of such a test which elicits an alternative
hypothesis. Suppose a hypothesis H is suggested, but as yet we have no
precisely defined alternatives. Then on our account we can test out H.
If it passes the tests, well and good. It can be provisionally accepted, and
used for some purpose. If on the other hand it fails the test, we will then
try to devise a new hypothesis H' which avoids the refutation. In such
cases the falsification of H provides the stimulus for devising an alternative
hypothesis. Now if we stick to the Neyman-Pearson approach, the
alternative hypothesis H' should have been devised before the very first
test of H, and that test should have been designed with the alternative in
mind. The practising statistician can justly complain that this is too much
to demand. He could point out that H might have been corroborated by
the tests in which case the trouble and mental effort of devising an alterna-
tive would have been unnecessary. Further he could argue that even if H
is falsified, the nature of the falsification will give a clue as to what alterna-
tive might work better. It would be silly to start devising alternatives
without this clue.
Now admittedly it is a very good thing if a scientist can devise a viable
alternative H' to a hypothesis H, even when H has not yet been refuted.
As we have just explained, Copernicus devised an alternative astronomical
theory even though the existing one (the Ptolemaic) was reasonably well
1 This view was held by Fisher who writes in ([31, P. 42):
"On the whole the ideas (a) ... and (b) that the purpose of the test is to discriminate
or "decide" between two or more hypotheses, have greatly obscured their under-
standing, when taken not as contingent possibilities but as elements essential to their
logic. The appreciation of such more complex cases will be much aided by a clear
view of the nature of a test of significance applied to a single hypothesis by a unique
body of observations."
In a sense what follows can be considered as an attempt to support this opinion of
Fisher's against that of Neyman and Pearson. Fisher's opinion has been supported
recently by Barnard, cf. his contribution in L. J. Savage and others [91.

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
A Falsifying Rule for Probability Statements 245

corroborated. It is unreasonable however to demand that Copernicus'


example be followed in every case. That would indeed be a counsel of
perfection. In most cases it is only the refutation of a given hypothesis H
which provides the stimulus for devising a new hypothesis H'. Consider
then the schema. A hypothesis H is suggested. It is tested 'in isolation',
and falsified. This falsification stimulates the production of an alternative
hypothesis H' which works better. We shall show in a moment by means of
examples that this schema frequently occurs in statistical investigations.
However it completely contradicts the model embodied in the Neyman-
Pearson theory.
So far when we have spoken of 'an alternative hypothesis', we meant
some hypothesis genuinely different from the one under test. But in
practice Neyman and Pearson do not use 'alternative hypothesis' in such a
sense, and this constitutes our second objection to their principle of
alternative hypotheses. In practice the alternative hypotheses considered
by Neyman and Pearson are nothing but the same hypothesis with
different parameter values. Suppose, for example, that the hypothesis

under test is that ( is normal ,0, u0, then the alternatives will be that ( is
normal with different jt, O (or, in some cases, just with different it). Thus
the alternatives generally considered when the Neyman-Pearson theory is
applied are merely trivial variants of the original hypothesis. But this is an
intolerably narrow framework. We could (and should) consider a much
wider variety of different alternatives. For example we might consider
alternatives which assigned a distribution to ( of a different functional
form. Again we might reject the assumption that the sample x1, ..., x, is
produced by n independent repetitions of a random variable & and try
instead a hypothesis involving dependence. We might even in some cases
replace a statistical hypothesis by a complicated deterministic one. By
restricting alternatives to such a narrow range, the Neyman-Pearson
theory places blinkers on the statistician, and discourages the imaginative
invention of a genuinely different hypothesis of one of the types just
mentioned. It must be remembered too that if a genuinely different
hypothesis is proposed and corroborated, the fact that the original falsifying
test was (say) UMP1 in a class of trivial variants ceases to have much
significance.
To this argument a reply along the following lines will no doubt be made
"These 'academic' objections of yours are all very well. We fully admit
that it would be nice to have a theory of testing which embodied all the
alternatives you speak of. But such a theory would be difficult, if not
impossible, to construct. In such a situation the practical man must be
1 I.e. 'uniformly most powerful'. We shall use this standard abbreviation throughout.

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
246 Donald A. Gillies

content with the best we can do which is to consider only certain simple
alternatives. Moreover the Neyman-Pearson model embodying these
simple alternatives finds frequent and useful application in statistical
practice." Against this I claim that the Neyman-Pearson model does not
fit most statistical situations at all well, and I shall try to show this by
means of examples.
Our first example of a statistical investigation which does not fit the
Neyman-Pearson theory is taken, oddly enough, from Neyman himself.
It occurs in his ([6], pp. 33-7). The problem dealt with arose in the field
of biology. An experimental field was divided into small squares and counts
of the larvae in each of these squares were made. The problem was to find
the probability distribution of the number n of larvae in a square. The first
hypothesis suggested was that this random variable had a Poisson distribu-
tion p. = exp (- A)Ah"/n! for some value of the parameter A (i.e. a
composite hypothesis). This was then tested by the X2-method. The
possible results were divided into io classes corresponding to o, i, ..., 8, 9
or more, observed larvae. The number me, s = o, i, ..., 9, observed in each
class was noted and the expected number m' was calculated given the
hypothesis. For the purposes of this calculation the unknown parameter
was estimated by the X2-minimum method. Finally the value of the
9

X2-statistic namely I (ms--m


s=0
w)2/m Was calculated. Under these circum-
stances it can be shown mathematically that the X2-statistic is
mately distributed in a X2-distribution with r-k--I degrees of
where we employ r classes and estimate k parameters from th
Here r = Io, k = i. Thus we have 8 degrees of freedom. The
the X2-statistic obtained was 46.8 and so we get a clear falsific
details of the results are given in Neyman [6], p. 33, Table III.
Let us pause to consider whether this test fits the general Ne
Pearson theory. It is clear that it does not. We have a composite h
that the distribution is Poisson for some value of the paramete
hypothesis includes all possible parameter values and thus w
generate alternative hypotheses in the usual Neyman-Pearson fa
varying one or more parameters. Of course this does not prevent
obtaining alternative hypotheses in some other way. But does N
do so? Does he set up alternatives and try to find say an UMP te
all. He accepts the usual X2-test without more ado. Let us next
again possible 'counter-intuitive' choices of critical region w
mentioned (Section 2, p. 241). These consisted in taking for our
region not the tails of the distribution, but a narrow region in t
If this were done for the x2-dn in this case the result 46.8 wo

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
A Falsifying Rule for Probability Statements 247

corroboration not a falsification. Naturally Neyman does not choose such a


counter-intuitive critical region, but from the point of view of his own
general theory he would be perfectly entitled to do so. On the Neyman-
Pearson theory these 'counter-intuitive' critical regions are only ruled out
because they have low power relative to some set of alternatives. Here we
have no set of alternatives, and thus the counter-intuitive regions become
perfectly possible. On our own account of course such regions are ruled
out because they have high i-values and this does not involve considering
alternative hypotheses.
Let us now continue with our account of Neyman's investigation. The
falsification of the Poisson hypothesis stimulated Neyman to produce a
new hypothesis. This he arrived at by a brilliant heuristic argument which
could well serve as a model of scientific reasoning. Neyman argues
(Neyman [6], pp. 34-5) that we would expect a Poisson distribution if each
larva is put on the field independently of the others. However this is not
what happens at all. A moth lays eggs at some point in the field which we
can suppose to be randomly selected, i.e. the egg-laying points will follow
a Poisson distribution. The moth lays a large number of eggs at once, and
from these eggs hatch the larvae which then crawl slowly away from their
birthplace in search of food. Naturally the number of eggs laid will vary
from point to point, and also different numbers of larvae will die in each
case. If the larvae crawl slowly, we will not expect them to be distributed
randomly, i.e. according to the Poisson distribution. Rather we would
expect them to be distributed in clumps of varying size about centres
which are randomly distributed. Neyman proceeded to construct a
mathematical distribution based on this mechanism. Its details are a little
complicated and we will follow him in referring to it as a 'Type A distribu-
tion' remarking only that it depends on two parameters. Neyman then
proceeded to test the Type A distribution using the x2-method just as
before. The only difference was that 2 parameters were now estimated
from the sample so that the X2-statistic had (approximately) the x2-
distribution with 7 degrees of freedom. The result of the test this time was
a definite corroboration, and consequently a triumphant vindication of
Neyman's heuristic reasoning.
At this point it might be objected: "You say that Neyman introduced a
Type A distribution. Thus he did after all produce an alternative hypothesis
as required by his general theory of testing." I admit that Neyman did
produce an alternative hypothesis, but only after the original Poisson
hypothesis had been tested and falsified. Thus the alternative hypothesis
was irrelevant to the first test of the Poisson hypothesis. This Neyman
himself says (Neyman [6], p. 34):

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
248 Donald A. Gillies

In all cases, the first theoretical distribution tried was that of Poisson. It will be
seen that the general character of the observed distribution is entirely different
from that of Poisson. There seems no doubt that a very serious divergence
exists between the phenomenon of distribution of larvae and the machinery
assumed in the mathematical model. When this circumstance was brought to my
attention by Dr. Beall, we set out to discover the reasons for the divergence (my
italics).

In other words it was only after the first test that Neyman attempted to
devise an alternative hypothesis. Indeed, as so often in science, it was the
falsification of a hypothesis which stimulated theoretical activity. Now of
course the Type A hypothesis could have been devised before the first
test. But as we pointed out above, it is unreasonable to demand scientific
ingenuity of this level every time a hypothesis is tested. Moreover had the
Poisson distribution proved satisfactory, the mental effort needed to pro-
duce the type A hypothesis would have been unnecessary.
A second point to notice concerns Neyman's handling of the second
test, i.e. of the test of the Type A hypothesis. Now here a genuine alterna-
tive exists-namely the Poisson hypothesis. If Neyman had been true to
his principles, he should have tried to devise say an UMP test of the Type
A hypothesis against the Poisson alternative. Of course he did not follow
this course which would have involved him in difficult (perhaps impos-
sibly difficult) mathematics, but instead used the standard X2-test. From
our point of view he was eminently justified. The X2-test procedure falsifies
one hypothesis and corroborates the other; it is thus genuinely crucial
between the two hypotheses, and hence severe and to be commended,
But is Neyman justified from the point of view of his own theory? Without
a complicated mathematical investigation of the power properties of the
X2-test, it is impossible to say.
For these reasons it cannot, I think, be denied that the piece of
statistical reasoning as actually carried out by Neyman was not fitted into
the Neyman-Pearson theory of testing. It might however be claimed that
it could, as it were retrospectively, be fitted into the theory. That is to say
we might, long after the event, propose some alternative hypotheses and
show that the tests used were in some sense optimal against these alterna-
tives. Indeed attempts have been made to fit the X2-test into the framework
of the Neyman-Pearson theory. We must now examine whether these
provide a solution to the present difficulty.
A typical such attempt is made by Lehmann in [51, PP- 303-6. Let us
suppose we are testing the hypothesis H0 that a certain real random
variable ( has a distribution F(x). We will suppose the distribution
completely specified so that the hypothesis is 'simple'. The test used is the

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
A Falsifying Rule for Probability Statements 249

ordinary x2-test with k--I degrees of freedom. We therefore begin by


partitioning the real line into k disjoint classes C1, ..., Ck, and calculating
the probabilities p0i, given H0, of the result lying in Ci, i = I,..., k. The
X2-statistic is then calculated in the usual way. Lehmann proposes that we
consider H0 in the form Ho:
prob( ~e Ci) = Poi i = I, ..., k
and set up a set of alternatives H, say defined by
prob({ e Ci) = Pti i = I, ..., k
where the pt, are arbitrary non-negative constants subject to the conditions
k

p, t p, i = I, ..., k and pP, =z=I1i. He next shows that the X2-test has
certain optimal properties relative to these alternatives. In fact these
optimal properties are hardly very convincing. However we will not stress
this point. Another difficulty is that in the present Neyman example we
are considering a composite hypothesis in which the distribution of (
contains certain arbitrary parameters. It is not clear that the present method
will apply to this case. However we propose to show that it is inadequate
even in the simple case and thus a fortiori in the composite one.
Our first objection is that, for the purposes of testing, H0 is replaced by

a hypothesis H0 which is not equivalent to it. In fact the assertion


prob(6 e C) - Poi, i = I, ..., k is compatible with many other distribu-
tions of ( besides F(x). Thus H6 is much weaker than H0. This illegitimate
replacement is not an accidental feature of the procedure. Suppose we
retained H0 in the form: : has the distribution F(x). Then it would be
natural to consider the alternatives: ( has the distribution G(x), where G
is an arbitrary distribution different from F. However relative to such a set
of alternatives no power properties could be established. Thus H0 has to

be replaced by H0 which depends on a number of parameters. We can thus


apparently generate satisfactory alternatives in the usual way by varying
parameter values. However these alternatives are really only plausible
against a quite different hypothesis from the one actually being tested.
A similar device is used by Neyman in constructing his 'smooth' test of
goodness of fit.
The next objection can be stated by introducing the notion of a 'serious'
alternative. Let us say that H' is a serious alternative to H if we might
actually adopt H' in the event of H being falsified. Now in my view it is
of no value considering alternatives which are not serious in this sense;
but the H, as defined above are evidently not serious. Suppose for example
F(x) is a continuous distribution. Then if we regarded Ho as falsified we
would probably try to replace it by a hypothesis H1 which assigned a
different continuous distribution to ( (as in the present Neyman case).

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
250o Donald A. Gillies

We could certainly not adopt a hypothesis of the form H1 in place of H0.


Anyone who doubts this should try to produce an example from statistical
practice where a hypothesis of the form: ( has continuous distribution F,
has been falsified and actually replaced by a hypothesis of the form H,.
I feel certain that it will be impossible to do so.
The artificiality and inadequacy of the alternatives H1 is further illus-
trated by observing that they depend on inessential details of the actual
testing procedure. Instead of the k classes C1, ..., Ck we could always
employ k--I or k i classes, or we could partition the real line into k
classes in a different way. In all cases we would obtain a different set of
alternatives He - illustrating the complete arbitrariness of the original set.
Indeed the situation is usually made still worse by restricting the H, to those
of the form
prob(% e Ci) = poi+ai/Vn i = 1, ..., k
where the ai are fixed constants and n is the sample size. Why, one wonders,
is ai/Vni chosen rather than asI(n)* or any other function of n? The answer
seems to be not that these alternatives are more realistic in any sense but
that they give mathematically more interesting results! What is bad here
however is not only the arbitrariness of the function of n but the fact that
the alternatives should depend on n at all. This amounts to saying that if
we draw a sample of ioo, we should have one set of alternatives, whereas
if we draw a sample of 150 we should have another set. Any genuine
realistic alternative however would be proposed quite independently of
the size of the sample used in the test.
Returning then to Neyman's piece of statistical inference, we can sum
up as follows. The logical structure of the example is clear. A composite
hypothesis H is proposed and tested 'in isolation', i.e. without formulating
precise alternatives. It is falsified by the test, and a new hypothesis H' is
produced by an ingenious heuristic argument. H' is in its turn tested, but
this time it is corroborated. Neyman has provided us with a model example
of scientific reasoning. It is strange that he does not see that it contradicts
his general theory of testing which is expounded only 20 pages later in the
same book.
Our second example is of exactly the same type as the first. We only give
it to show that cases of this type are not rare but on the contrary common
in statistical reasoning. It is taken from Crambr [2], p. 44I. Cram6r is
considering the distribution of the breadths of beans of Phaseolus vulgaris.
He has data for Iz2,00ooo of these beans. His first hypothesis is that the
breadths are normally distributed for some mean it and standard deviation
a. He applies the X2-test (estimating the two parameters) to this hypo-
thesis, and a clear falsification results. Once again the standard Neyman-

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
A Falsifying Rule for Probability Statements 251

Pearson method of generating alternative hypotheses by varying parameters

is not available, since all values of / and u are allowed as possibilities.


Further no other alternative hypotheses are suggested at this stage.
Cramer next argues thus. The normal distribution was suggested by
the following considerations. It is reasonable to suppose that the deviations
of the breadths of the beans from some mean value are caused by the
operation of a large number of chance factors. Suppose the ith of these
causes a small deviation fi and there are n factors all told. Then the total
deviation ( is given by

Now the central limit theorem yields that under certain very general
conditions ( will tend to a normal distribution as n - oc. But next suppose
that n is not large enough for the normal approximation to apply. Can we
get a better approximation to the distribution of sums like 9? This mathe-
matical question had been investigated by Cramer in his 1937 Cambridge
Monograph, Random Variables and Probability Distributions, and it was
natural for him to apply the results to the case in hand. In fact we do get a
better approximation by adding to the normal frequency function
successive terms of the Edgeworth series. Consequently Cramer modified
his original hypothesis by adding the first term of the Edgeworth series.
He applied the X2-test this time estimating 3 parameters. The result was
again a falsification. Cramer then added the first and second terms of the
Edgeworth series, applied the X2-test estimating 4 parameters, and obtained
on this occasion a corroboration.

These then are my counter examples drawn from statistical practice.


Let us now pause for a moment and examine what objections could be
raised to my arguments against the Neyman-Pearson theory. An objector
might argue thus: "You have shown that there are certain cases of statistical
inference where the Neyman-Pearson model does not work very well, but
of course there are many cases where it is entirely satisfactory. We agree
that the Neyman-Pearson theory in general considers only simple alterna-
tives obtained by varying parameter values. It would certainly be nice to
have a theory which took into account more general alternatives-for
example distributions of different functional form. However such a theory
would involve too many mathematical difficulties to be possible at the
moment-though no doubt it will be possible in time. We are practical
men and must content ourselves with the best that can be done at the
moment. The trouble with your criticisms is that they are too negative.
You complain about the deficiencies of the Neyman-Pearson theory
without suggesting anything better to put in its place."
My answer is this. Firstly I deny that there are 'many cases where the

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
252 Donald A. Gillies

Neyman-Pearson model works well'. I would claim on the contrary that


it is hardly ever realistic-though I will not attempt to justify this assertion.
Secondly to the criticism that I am attempting to demolish the Neyman-
Pearson theory without putting anything in its place, my reply is that, if the
proposed falsifying rule is accepted, the problem which the Neyman-
Pearson theory sets out to solve no longer becomes so serious. It can
moreover be dealt with by certain simple qualitative considerations without
introducing a complicated mathematical theory. I will now elaborate
this.

Let us suppose we have a statistical hypothesis H stating that a random


variable S has a certain distribution for some values of a set of k parameters

(FI,n .,?-
of pk).
values of Suppose
E. How canfurther
we test H? weInhave
orderdata consisting
to do so we mustoffind
a set
a (x1, ..., x,)
statistic 4, i.e. a function rl(x1, ..., xn) of the sample which satisfies the
following three conditions:
(i) It must be possible to calculate mathematically the distribution D
of q given H, or at least to find a distribution D which is a good approxima-
tion to y's true distribution.

(2) D must be independent of the parameters tz, ..., tk.


(3) D must be a falsifiable distribution.

Our test then consists of regarding H as falsified if 7 C where C is the


critical region associated with D.
The standard statistical tests of course satisfy these conditions. In the

x2-test we calculate the value of the X2-statistic, estimating the values of


the k parameters by the x2-minimum method. Our mathematical theory
shows that the X2-statistic if calculated in this way has approximately the
X2-distribution with r-k--I degrees of freedom where r is the number of
classes employed. Further the x2-distribution is falsifiable for a suitably
large number of degrees of freedom. To take another example. Suppose
we are using the t-test on the hypothesis that ( has a normal distribution
with zero mean and some standard deviation u. We can then calculate
that the t-statistic, viz. (n--I)+(x/s) where x (x[+ ... +x,)/n and
sa = I/n Z i=1
(x,--)2 has the t-distribution with n--I degrees of freedom.

The t-distribution is of course independent of a. Further for s


large n it is falsifiable.
My next point is that it is extremely difficult to find test-s
satisfying conditions (I), (2) and (3). In fact only a handful
statistics have been discovered for standard hypotheses, and w
consequently admire all the more the skill of those men who w
to find these statistics (i.e. K. Pearson, Student (W. S. Gosset),

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
A Falsifying Rule for Probability Statements 253

Fisher). Neyman and Pearson represent the situation as one in which we


have a very large number of possible tests and it is most important to
select out of this number those which are best in some sense. This is not
the case at all. Only a very few tests are available and it would require
considerable ingenuity to devise more. Consequently choosing a test will
probably not be difficult and, if it is difficult, this is more likely to be
because there is no test available than because there are a large number
from which we must select one. We do not suffer from an embarras di
richesse as far as tests are concerned.
It will no doubt be asked at this point: "Why, if there really are only a
few tests available, did Neyman and Pearson think there were so many?"
The answer of course is: "Because Neyman and Pearson required that
the rejection class C should have a low k-value (low level of significance),
but not that it should have a low i-value (low relative likelihood)." As we
have pointed out several times, if we demand only a low k-value for C
then a large number of choices of C become possible many of them counter-
intuitive. To eliminate these counter-intuitive choices Neyman and
Pearson introduced their principle of alternative hypotheses, and required
that we choose only those C which have high power relative to some set of
alternatives. We however require that C should have not only a low
k-value but also a low i-value (and that certain further conditions should
be satisfied). This eliminates the counter-intuitive choices of C without
the need for considering alternative hypotheses. Further it reduces
drastically the number of possible tests. Tests can no longer be produced
more or less mechanically as in the Neyman-Pearson theory. Ingenuity is
necessary to devise a test. Consequently the problem of choosing between
different possible tests is no longer such a serious one. We do not however
wish to maintain that the problem disappears altogether. There may indeed
be situations where a number of different statistical tests are genuinely
available, and we want to choose one or two out of this number. What con-
siderations should guide our choice? My answer is that certain simple
qualitative considerations suffice for this purpose. There is no need to have
a precise mathematical theory to determine our choice. Moreover these
qualitative considerations will not in general involve a consideration of
alternative hypotheses. I will give an example of the kind of consideration
I have in mind in the next section.
Against this it could be objected that quantitative considerations are
preferable to qualitative ones, and hence the Neyman-Pearson theory to
the view just expounded. However the Neyman-Pearson theory is just
as dependent on qualitative considerations as we shall now show. This
result follows in fact from a point made by Cox, but used by him for a

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
254 Donald A. Gillies

different purpose. I quote his remarks from L. J. Savage and others [9],
p. 84:
Suppose that our simple hypothesis says that the density of the observations is
fo(x), and that the test consists in calculating the function t(x) and regarding
large values of t(x) as evidence against a null hypothesis. Suppose we consider
the following family of hypotheses:

fo(x) = fo(x)e Otxl "f?(x) eot'x) dx.


That is a family of hypotheses depending on the parameter 0; when 0 - o it
reduces to the null hypothesis. Clearly the uniformly most powerful test of
0 o is based on large values of t.

This shows that for null hypotheses of the type considered by Cox any
test whatever is uniformly most powerful relative to some set of alterna-
tive hypotheses. Thus the property of being uniformly most powerful can
only be significant if the set of alternatives introduced is in some sense
realistic as opposed to arbitrary and artificial. But how do we decide that a
set of alternatives is realistic? Only qualitative considerations will help us
here.

More generally on the view developed here we have to use qualitative


considerations to choose between the various possible tests. This may
seem a difficult task. However in the Neyman-Pearson theory we have to
choose between the various possible sets of alternative hypotheses. This is
just as difficult a task, and we are forced just as much to employ qualitative
considerations. Indeed it could be claimed that the task is harder in the
case of the Neyman-Pearson theory. As we remarked above (pp. 249-50)
highly artificial alternatives are often employed in the Neyman-Pearson
theory. Thus we apparently must use qualitative considerations to decide
between artificial and very artificial sets of alternatives. This is hardly an
inviting task.

4 A REPLY TO SOME OBJECTIONS OF NEYMAN'S


The time has now come to consider the objections of Neyman's which
were mentioned in the introduction.1 They are expounded in Neyman [6],
PP. 43-54. Broadly speaking Neyman is attacking the view that it is
possible to devise a good test of a hypothesis without taking into account
alternative hypotheses. As he puts it himself [6], p. 44:

It is known that some statisticians are of the opinion that good tests can be
devised by taking into consideration only the hypothesis tested. But my opinion
is that this is impossible ....
1 I am grateful to Colin Howson for first drawing my attention to these objections.

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
A Falsifying Rule for Probability Statements 255

The view here under attack is certainly one to which I would subscribe.
It seems to be perfectly possible in some circumstances to devise a good
test of a statistical hypothesis without taking into account alternative
hypotheses. Indeed I would cite Neyman's X2-test of the Poisson hypothesis
as an example of this. It therefore becomes necessary to try and refute
Neyman's arguments.
Neyman proceeds by proving two mathematical results, and then claim-
ing that these results raise impossible difficulties for the position he is
attacking. Like him we will begin by stating and proving these results-
giving in fact rather simpler proofs based on a method of Cramer's.
Throughout we will be concerned with the statistical hypothesis H that
x, ..., x, are independent and normally distributed with zero mean and
standard deviation r. To test this we would customarily consider the
t-statistic defined by t = (n--I)?x/s where

x= (xl+ ... +x,)n


s2=-I/n (x,--)2
i= 1

However Neyman considers the trivially different z-statistic z = i/s, and


we will follow him in this throughout the present section. The z-statistic
given H has the distribution with frequency function

f(z) = I/B[(n-I), 1](I+,z2)-n/2


We shall call this the z-dn with n--I d. of fr. Neyman's two results
follow simply from the following lemma. In our statement of this lemma
and the subsequent theorems we will take z, x1, ..., x, to be as just defined.

LEMMA. If

(n)i(x')= x l -+- ... + nx, where : i


i=1

and

ns'2 _ x-'2
i=1

then x'/s' has the z-dn with n--I d. of fr.


Proof (cf. Cramer [2], pp. 379-82). Consider yi ..., y, where

yi = cilx1+ci2x2+ ... +c*,x,


and (csi) is an orthogonal transformation, i.e. a rotation in n-sp
joint dn of y, ..., y, is normal and we have
E(y ) = o

E(ysy,) r2j=l
"c2cki
for: ifori= k
{ao for i= k

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
256 Donald A. Gillies

by definition of an orthogonal transformation. Thus the new variables yi


are uncorrelated. But they are normal. .'. They are independent, i.e.
yi, ...,Y are (like x1, ..., x,) independent and normally distributed with
mean zero and standard deviation a. But now set z1 = (nIx)' so that

zi = Gxix+ ... +x, (I)


Sy = I..". (I) is the first line of an orthogonal transformation. Extend
i= 1

this to a complete orthogonal transformation, and let the corresponding


variables be z2, ..., zx,. Then by the result just proved zx/(zj + ... z )*
has the z-dn with n- I d. of fr. This is a standard result which is in fact
usually used to introduce the z-dn (cf. Cramer (2), pp. 237-41). But
since the transformation is orthogonal

xf+ ... +x = z2+ ... +


ns'2 = x -nx'-2 - 2
i=1 i=2

x'/s'= z1(z+
and the result is pr
as the following two
THEOREM I. In the situation under consideration we can find a statistic
5 of the r.v.'s x1, ..., x, s.t.
(I) 2 like z has the z-dn with n--I d. of fr.
(2) Izl I .
Proof. Set (n)*x' = (x,--x,)/(2)
ns'2 = x2-nx'2
i= 1

6 = x'/s'
Then by lemma 5 has the z-dn with n--I d. of fr. as required. The second
property (cf. Neyman (4), p. 50) follows from some simple algebraic
inequalities. We have for any real numbers a, b

(a'fb)2 o
.. 2(a2+b2) > a2+b2?2ab = (aIb)2
But now

2nx'2 = (x x)2 = ((X-)--(x-))2


< 2{(X1X--x)2 -+(x2 -X)2}

. 2 2=1
2 (xd--)2 = 2ns2
"~ ~ ' <s2 (2)

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
A Falsifying Rule for Probability Statements 257

However
?n

s'2+x'2 = s2+~2 = xt)/n i=1

.'. (2) yields 2 < s'2 (3)


.'. Multiplying (2) and (3) and dividing b
(x2/s2)(x'2/s'2) < I
.z. lz < i as required.
THEOREM 2. Suppose we have a samp
different from zero, then we can define
(I) 50 has the x-dn with n--i d. of fr.
(2) 50 takes the value + oo for the obse
Proof. Set c,.- x/(x+ ... +x2)t i =I
above lemma and set 5o = x'/s'. By t

n--i d. of fr. However observed val


is infinite.
is non-zero by hypothesis, and of ns'

We must now explain the supposed pa


results. Let us take Theorem I first. 5
Therefore apparently we can use 4 to g
rejecting H if the observed value of I41
course the standard z-test which consis
But now by Theorem I txz1 < I. Thus i
falsification, (1 will be small and the a-
of course vice versa. As the two criter
conclusions, it looks as if it is necessary t
puts it ([6], p. 45):

Whenever one of these criteria has the


proving' the hypothesis tested, the values o
ones. This last circumstance will make it n

But now let us examine how we migh


consider only H, then apparently there
as they both have the same distribution
to be an appeal to possible alternative h
the Neyman-Pearson theory, the soluti
tive hypotheses HpL that x1, ..., x, are
buted with some mean 4 t o and arbitr
to H/L, the z-test is UMPU1 but the 5
1 Uniformly most powerful unbiased
R

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
258 Donald A. Gillies

Neyman-Pearson property. Thus we would naturally select the z-test.


But how could we make our choice without considering alternative
hypotheses? Or, to put it another way, how could we eliminate the
intuitively unsatisfactory a-test?
The difficulties raised by Theorem 2 are even easier to see. Go has the
z-dn with n--I d. of fr. Thus we can apparently base a test on this
statistic. But as 5o has the value + o for the given sample, the hypothesis
will fail the So-test. Thus, whatever the sample, we can apparently obtain a
test which will result in H being falsified, and the whole testing procedure
is invalidated. Of course Neyman's solution would be to eliminate 5o as
not being for example an UMPU test.
These then are the difficulties which Neyman raises for those who
reject the principle of alternative hypotheses as embodied in the Neyman-
Pearson theory. We will now try to resolve these apparent paradoxes,
starting with those generated by Theorem I. Neyman points out the
peculiar relationship between the z-test and the c-test: namely that if H is
falsified by the z-test it is corroborated by the c-test and vice versa.
He concludes that this makes it necessary to choose between the two tests.
As he puts it in a sentence already quoted (Neyman [6], p. 45):

This last circumstance will make it necessary to choose one of the criteria.

Against this I maintain that the a-test and the z-test are both entirely
valid tests. It would be possible with justice to apply either or both of
them. The relationship between the two tests appears strange at first
sight, but there is in fact nothing paradoxical about it. To establish this I
propose to consider a hypothesis drawn from an unproblematic area of
deterministic physics; to describe two tests T1 and T2 which everyone
would accept as valid; and to show that T1 and T2 are related in the same
way as Neyman's z-test and a-test.
The example I have in mind is none other than Galileo's law that
falling bodies have in vacuo a constant acceleration of 98I cm/sec2. We
might be able to calculate from this law and certain assumed laws concerning
the fracture of glass that if a steel ball of a certain size is dropped from a
height h it will acquire a velocity sufficient to shatter and pass through a
glass plate of thickness less than a, but that a glass plate of thickness
greater than b will stop the ball without shattering. Now define test T1 and
T, as follows:
T1: Drop a steel ball of the given size from height h on to a glass plate
of thickness ax where ax < a. If the plate shatters and the ball
continues its downward course Galileo's law is confirmed. If the plate
stops the ball Galileo's law is falsified.

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
A Falsifying Rule for Probability Statements 259

T,: Drop a steel ball of the given size from height h on to a glass plate
of thickness b1 where b1 > b. If the plate stops the ball Galileo's
law is confirmed. If the ball shatters the plate and passes through,
Galileo's law is falsified.

T1 and T2 are admittedly somewhat involved and impractical tests, but


there is nothing wrong or paradoxical about them. We could with entire
validity use either or both of them as tests of Galileo's law. But now
observe that if T1 falsifies Galileo's law, T2 is bound to corroborate it, and
vice versa. I conclude that there is nothing necessarily paradoxical about
two tests being related in this way. It does not show that one of the tests
is bad, or that we have to choose between them.
We see then that Neyman's claim that we must choose between the
z-test and the c-test is not valid. Indeed it seems to me that we are quite
entitled to use either or both of the two tests. This would be certainly
true if we drew two samples of size n, applied the z-test using one sample
and the c-test using the other. What is perhaps more doubtful is whether
we can apply both tests to the same sample. In fact this raises the general
problem of whether, given a single sample, we can use a variety of tests
based on different statistics. If we do apply more than one test, we are,
or at least might be, in a certain sense increasing the k-value employed.
As we wish to keep the k-value below a certain level, this rather calls into
question the whole procedure. However in many situations it does seem
reasonable to use a number of test statistics with only one sample. As far
as I can see no hard and fast rules can be laid down here, and it is a matter
which must be left to common sense.
So then there is no positive necessity for choosing between the z-test
and the c-test. However we would probably wish to choose between them
in practice, and in fact to select the z-test rather than the c-test. My next
point is that a reason can be given for preferring the z-test to the c-test
which does not involve alternative hypotheses. This refutes Neyman's
claim that to make the choice we need to consider possible alternatives.
What I have in mind is an application of the principle that we should
prefer tests based on statistics which either measure quantities of practical
interest or are closely related to such quantities. Now suppose, as would
typically be the case in practice, that xa measured the difference in yield
of two grains A and B planted in the two halves of an experimental plot.
Under these circumstances & measures for the sample the average increase
(or decrease) in yield of A relative to B. Now this average increase or
decrease is obviously of great practical importance because applied to the
whole crop it will determine the gain or loss which will result if we use

A rather than B. On the other hand ?' = (x--x2)/(2)t does not give a

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
260 Donald A. Gillies

measure for the sample of any quantity of practical importance. Applying


the above principle we would prefer the z-test which is based on a norma-
lised version of 5 to the c-test which is correspondingly related to i'. This
is an example of the 'qualitative considerations' mentioned at the end of
section 3.
Let us now pass to the problems created by Theorem 2. Neyman has
shown that if we examine the actual sample and design a test in the light
of the values obtained, then we can always manage to refute the underlying
statistical hypothesis. However it is a well-known fact of statistical practice
that if we design our tests in the light of the values obtained, we get
misleading results. This is shown by the following quotation from Cochran
and Cox ([i], pp. 73-4):
In order that F- and t-tests be valid, the tests to be made in an experiment should
be chosen before the results have been inspected. The reason for this is not hard to
see. If tests are selected after inspection of the data, there is a natural tendency to
select comparisons that appear to give large differences. Now large apparent differ-
ences may arise either because there are large real effects, or because of a fortuitous
combination of the experimental errors. Consequently, in so far as differences are
selected just because they seem to be large, it is likely than an undue proportion
of the cases selected will be those where the errors have combined to make the
differences large. The extreme case most commonly cited is that of the experi-
menter who always tests by an ordinary t-test, the difference between the
highest and lowest treatment means. If the number of treatments is large, this
difference will be substantial even when the treatments produce no real differences
in effect.

The obvious way out of these difficulties is the one suggested by Cochran
and Cox. We should lay down the rule that our tests of a given hypothesis
should be designed before the sample values are inspected. This rule has
great appeal for common sense, is actually used in statistical practice, and
avoids Neyman's second objection without an appeal to the principle of
alternative hypotheses.
I conclude that, although our approach is no doubt liable to many
objections, it is at least not refuted by those which Neyman raises.

Kings College, Cambridge

REFERENCES

[I] COCHRAN, W. G. and Cox, G. M. (1957) Experimental Designs


Inc., New York.
[z] CRAMER, H. (1946) Mathematical Methods of Statistics. Princet
[3] FISHER, R. A. (1959) Statistical Methods and Scientific Infere
& Boyd.

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms
A Falsifying Rule for Probability Statements 261

[4] LAKATOS, I. (1970) Falsification and the Methodology of Scientific Research Pro-
grammes, in Criticism and the Growth of Knowledge. Eds. I. Lakatos and A.
Musgrave, pp. 9I-195. Cambridge University Press.
[5] LEHMANN, E. L. (1959) Testing Statistical Hypotheses. John Wiley and Sons, Inc.,
New York.

[6] NEYMAN, J. (1952) Lectures and Conferences on Mathematical Statistics and Probability,
znd Edition. Washington.
[7] NEYMAN, J. and PEARSON, E. S. (1967). The testing of statistical hypotheses in relation
to probabilities a priori, in Joint Statistical Papers of J. Neyman & E. S. Pearson,
pp. 186-zoz. Cambridge University Press.
[8] POPPER, K. R. (1959) The Logic of Scientific Discovery, I934. English Edition:
Hutchinson.

[9] SAVAGE, L. J. et al. (1961) The Foundations of Statistical Inference. Methuen.


[Io] TODHUNTER, I. (I949) A History of the Mathematical Theory of Probability. Chelsea.

This content downloaded from 129.11.21.2 on Thu, 25 Aug 2016 02:13:15 UTC
All use subject to http://about.jstor.org/terms

You might also like