Corpus Linguistics: Anatol Stefanowitsch
Corpus Linguistics: Anatol Stefanowitsch
Corpus Linguistics: Anatol Stefanowitsch
Anatol Stefanowitsch
language
Textbooks in Language Sciences 7 science
press
Textbooks in Language Sciences
In this series:
3. Freitas, Maria João & Ana Lúcia Santos (eds.). Aquisição de língua materna e não
materna: Questões gerais e dados do português.
ISSN: 2364-6209
Corpus linguistics
A guide to the methodology
Anatol Stefanowitsch
language
science
press
Stefanowitsch, Anatol. 2020. Corpus linguistics: A guide to the methodology
(Textbooks in Language Sciences 7). Berlin: Language Science Press.
This title can be downloaded at:
http://langsci-press.org/catalog/book/148
© 2020, Anatol Stefanowitsch
Published under the Creative Commons Attribution-ShareAlike 4.0 Licence (CC
BY-SA 4.0): http://creativecommons.org/licenses/by-sa/4.0/
ISBN: 978-3-96110-224-2 (Digital)
978-3-96110-225-9 (Hardcover)
978-3-96110-226-6 (Softcover)
ISSN: 2364-6209
DOI:10.5281/zenodo.3735822
Source code available from www.github.com/langsci/148
Collaborative reading: paperhive.org/documents/remote?type=langsci&id=148
Acknowledgments ix
3.2.2.2 Length . . . . . . . . . . . . . . . . . . . . . . 90
3.2.2.3 Discourse status . . . . . . . . . . . . . . . . . 93
3.2.2.4 Word senses . . . . . . . . . . . . . . . . . . . 96
3.2.2.5 Animacy . . . . . . . . . . . . . . . . . . . . . 98
3.2.2.6 Interim summary . . . . . . . . . . . . . . . . 100
3.3 Hypotheses in context: The research cycle . . . . . . . . . . . . 100
ii
Contents
7 Collocation 215
7.1 Collocates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
7.1.1 Collocation as a quantitative phenomenon . . . . . . . . 217
7.1.2 Methodological issues in collocation research . . . . . . 221
7.1.3 Effect sizes for collocations . . . . . . . . . . . . . . . . 224
7.1.3.1 Chi-square . . . . . . . . . . . . . . . . . . . . 225
7.1.3.2 Mutual Information . . . . . . . . . . . . . . . 226
7.1.3.3 The log-likelihood ratio test . . . . . . . . . . 227
7.1.3.4 Minimum Sensitivity . . . . . . . . . . . . . . 228
7.1.3.5 Fisher’s exact test . . . . . . . . . . . . . . . . 228
7.1.3.6 A comparison of association measures . . . . 229
7.2 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
7.2.1 Collocation for its own sake . . . . . . . . . . . . . . . . 233
7.2.1.1 Case study: Degree adverbs . . . . . . . . . . 234
7.2.2 Lexical relations . . . . . . . . . . . . . . . . . . . . . . 235
7.2.2.1 Case study: Near synonyms . . . . . . . . . . 235
7.2.2.2 Case study: Antonymy . . . . . . . . . . . . . 240
7.2.3 Semantic prosody . . . . . . . . . . . . . . . . . . . . . . 244
7.2.3.1 Case study: True feelings . . . . . . . . . . . . 245
7.2.3.2 Case study: The verb cause . . . . . . . . . . . 249
7.2.4 Cultural analysis . . . . . . . . . . . . . . . . . . . . . . 254
7.2.4.1 Case study: Small boys, little girls . . . . . . . 255
8 Grammar 261
8.1 Grammar in corpora . . . . . . . . . . . . . . . . . . . . . . . . . 261
iii
Contents
9 Morphology 309
9.1 Quantifying morphological phenomena . . . . . . . . . . . . . . 310
9.1.1 Counting morphemes: Types, tokens and hapax legomena 310
9.1.1.1 Token frequency . . . . . . . . . . . . . . . . . 312
9.1.1.2 Type frequency . . . . . . . . . . . . . . . . . 315
9.1.1.3 Hapax legomena . . . . . . . . . . . . . . . . . 316
9.1.2 Statistical evaluation . . . . . . . . . . . . . . . . . . . . 318
9.2 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
9.2.1 Morphemes and stems . . . . . . . . . . . . . . . . . . . 325
9.2.1.1 Case study: Phonological constraints on -ify . 326
9.2.1.2 Case study: Semantic differences between -ic
and -ical . . . . . . . . . . . . . . . . . . . . . 330
iv
Contents
10 Text 353
10.1 Keyword analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 353
10.2 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
10.2.1 Language variety . . . . . . . . . . . . . . . . . . . . . . 358
10.2.1.1 Case study: Keywords in scientific writing . . 358
10.2.1.2 Case study: [a + __ + of] in Scientific English . 359
10.2.2 Comparing speech communities . . . . . . . . . . . . . 361
10.2.2.1 Case study: British vs. American culture . . . 367
10.2.2.2 Case study: “African” keywords . . . . . . . . 370
10.2.3 Co-occurrence of lexical items and demographic categories 372
10.2.3.1 Case study: A deductive approach to sex dif-
ferences . . . . . . . . . . . . . . . . . . . . . 372
10.2.3.2 Case study: An inductive approach to sex dif-
ferences . . . . . . . . . . . . . . . . . . . . . 378
10.2.4 Ideology . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
10.2.4.1 Case study: Political ideologies . . . . . . . . . 380
10.2.4.2 Case study: The importance of men and women 384
10.2.5 Time periods . . . . . . . . . . . . . . . . . . . . . . . . 387
10.2.5.1 Case study: Verbs in the going-to future . . . . 387
10.2.5.2 Case study: Culture across time . . . . . . . . 392
11 Metaphor 397
11.1 Studying metaphor in corpora . . . . . . . . . . . . . . . . . . . 397
11.2 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
11.2.1 Source domains . . . . . . . . . . . . . . . . . . . . . . . 398
11.2.1.1 Case study: Lexical relations and metaphorical
mapping . . . . . . . . . . . . . . . . . . . . . 399
11.2.1.2 Case study: Word forms in metaphorical map-
pings . . . . . . . . . . . . . . . . . . . . . . . 402
11.2.1.3 Case study: The impact of metaphorical ex-
pressions . . . . . . . . . . . . . . . . . . . . . 406
v
Contents
12 Epilogue 437
References 453
Index 477
Name index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
Subject index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
vi
Preface
This book has been a long time in the making (the first version of the first chapter
has the date stamp 2005-08-04, 11:30) and as the field of corpus linguistics and
my own perspective on this field developed over this time span, many thoughts
accumulated, that I intended to put into the preface when the time would come
to publish. Now that this time is finally there, I feel that there is not much left to
say that I have not already said in the book itself.
However, given that there is, by now, a large number of corpus-linguistic text-
books available, ranging from the very decent to the excellent, a few words seem
in order to explain why I feel that it makes sense to publish another one. The
main reason is that I have found, in my many years of teaching corpus linguis-
tics, that most available textbooks are either too general or too specific. On the
one hand, there are textbooks that provide excellent discussions of the history
of corpus linguistics or the history of corpus design, or that discuss the episte-
mological status of corpus data in a field that has been dominated far too long
by generative linguistic ideas about what does and does not constitute linguistic
evidence. On the other hand, there are textbooks that focus one or more specific
corpus-based techniques, discussing very specific phenomena (often the research
interests of the textbook authors themselves) using a narrow range of techniques
(often involving specific software solutions).
What I would have wanted and needed when I took my first steps into cor-
pus linguistic research as a student is an introductory textbook that focuses on
methodological issues – on how to approach the study of language based on
usage data and what problems to expect and circumvent. A book that discusses
the history and epistemology of corpus linguistics only to the extent necessary to
grasp these methodological issues and that presents case studies of a broad range
of linguistic phenomena from a coherent methodological perspective. This book
is my attempt to write such a textbook.
The first part of the book begins with an almost obligatory chapter on the
need for corpus data (a left-over from a time when corpus linguistics was still
somewhat of a fringe discipline). I then present what I take to be the method-
ological foundations that distinguish corpus linguistics from other, superficially
Preface
similar methodological frameworks, and discuss the steps necessary to build con-
crete research projects on these foundations – formulating the research question,
operationalizing the relevant constructs and deriving quantitative predictions,
extracting and annotating data, evaluating the results statistically and drawing
conclusions. The second part of the book presents a range of case studies from the
domains of lexicology, grammar, text linguistics and metaphor, including varia-
tionist and diachronic perspectives. These case studies are drawn from the vast
body of corpus linguistic research literature published over the last thirty years,
but they are all methodologically deconstructed and explicitly reconstructed in
terms of the methodological framework developed in the first part of the book.
While I refrain from introducing specific research tools (e.g. in the form of
specific concordancing or statistics software), I have tried to base these case
studies on publicly available corpora to allow readers to replicate them using
whatever tools they have at their disposal. I also provide supplementary online
material, including information about the corpora and corpus queries used as
well as, in many cases, the full data sets on which the case studies are based. At
the time of publication, the supplementary online material is available as a zip
file via http://stefanowitsch.net/clm/clm_v01.zip and https://osf.io/89mgv and as
a repository on GitHub via https://github.com/astefanowitsch/clm_v01. I hope
that it will remain available at least at one of these locations for the foreseeable
future. Feel free to host the material in additional locations.
I hope that the specific perspective taken in this book, along with the case
studies and the possibility to study the full data sets, will help both beginning and
seasoned researchers gain an understanding of the underlying logic of corpus
linguistic research. If not – the book is free as in beer, so at least you will not
have wasted any money on it. It is also free as in speech – the Creative Commons
license under which it is published allows you to modify and build on the content,
remixing it into the textbook you would have wanted and needed.
viii
Acknowledgments
This book was written, with interruptions, over a period of 15 years, during which
I used many versions of it in my seminars and communicated with dozens of col-
leagues about methodological issues I came up against in writing it. My thanks go
to my students at the Universität Bremen, the Universität Hamburg and the Freie
Universität Berlin and to Harald Baayen, Claudia Baldermann, Michael Barlow,
Timothy Colleman, Alice Deignan, Holger Diessel, Stefan Evert, Kerstin Fischer,
Susanne Flach, Adele Goldberg, Juliana Goschler, Stefan Gries, Beate Hampe, Ste-
fan Hartmann, Thomas Herbst, Jürgen Hermes, Martin Hilpert, Veronika Koller,
Jeong-Hwa Lee, Yoo Yung Lee, Britta Mondorf, Carlos Nash, Erich Neuwirth,
Klaus-Uwe Panther, Günter Radden, Alexander Rauhut, Ada Rohde, Doris Schö-
nefeld, Peter Uhrig, Daniel Wiechmann, Stefanie Wulff, Hilary Young, Arne Ze-
schel, and to everyone I have forgotten here. A few of these people deserve a
more detailed mention for the substantial roles they played in the emergence
of this book. Klaus-Uwe Panther and Günter Radden encouraged my first steps
into corpus linguistics when supervising my master’s thesis, and Michael Barlow
continued this encouragement during my time as a Ph.D. student at Rice Uni-
versity. My fellow student Ada Rohde and I spent many hours sifting through
concordances for our dissertations and discussing questions of retrieval and an-
notation over a beer or two. My collaboration with Stefan Gries during the first
decade of the new millennium was an exciting and most instructive time in my
research career. Susanne Flach read and commented on every chapter of the final
version of the book, which I started preparing in 2014, and she also introduced
me to the excellent Corpus Workbench environment, without which the book
would have taken another 15 years. Yoo Yung Lee proof-read the manuscript and
checked many of the queries. Two anonymous reviewers, one of whom revealed
himself to be Stefan Hartmann, provided stern but constructive comments on
the manuscript. Stefan Müller’s founding of the open-access Language Science
Press provided me with an incentive to finally get serious about finishing this
book, and Felix Kopecky and Sebastian Nordhoff always remained friendly and
supportive when LATEX drove me to dispair. Finally, Juliana Goschler has been,
and continues to be, a wonderful colleague and the light of my life.
1 The need for corpus data
Broadly speaking, science is the study of some aspect of the (physical, natural
or social) world by means of systematic observation and experimentation, and
linguistics is the scientific study of those aspects of the world that we summarize
under the label language. Again very broadly, these encompass, first, language
systems (sets of linguistic elements and rules for combining them) as well as men-
tal representations of these systems, and second, expressions of these systems
(spoken and written utterances) as well as mental and motorsensory processes
involved in the production and perception of these expressions. Some linguists
study only the linguistic system, others study only linguistic expressions. Some
linguists study linguistic systems as formal entities, others study them as mental
representations. Some linguists study linguistic expressions in their social and/
or cultural contexts, others study them in the context of production and compre-
hension processes. Everyone should agree that whatever aspect of language we
study and from whatever perspective we do so, if we are doing so scientifically,
observation and experimentation should have a role to play.
Let us define a corpus somewhat crudely as a large collection of authentic text
(i.e., samples of language produced in genuine communicative situations), and
corpus linguistics as any form of linguistic inquiry based on data derived from
such a corpus. We will refine these definitions in the next chapter to a point where
they can serve as the foundation for a methodological framework, but they will
suffice for now.
Defined in this way, corpora clearly constitute recorded observations of lan-
guage behavior, so their place in linguistic research seems so obvious that anyone
unfamiliar with the last sixty years of mainstream linguistic theorizing will won-
der why their use would have to be justified at all. I cannot think of any other
scientific discipline whose textbook authors would feel compelled to begin their
exposition by defending the use of observational data, and yet corpus linguistics
textbooks often do exactly that.
The reasons for this defensive stance can be found in the history of the field,
which until relatively recently has been dominated by researchers interested
mainly in language as a formal system and/or a mental representation of such
1 The need for corpus data
a system. Among these researchers, the role of corpus data, and the observa-
tion of linguistic behavior more generally is highly controversial. While there
are formalists who have discovered (or are beginning to discover) the potential
of corpus data for their research, much of the formalist literature has been, and
continues to be, at best dismissive of corpus data, at worst openly hostile. Corpus
data are attacked as being inherently flawed in ways and to an extent that leaves
them with no conceivable use at all in linguistic inquiry.
In this literature, the method proposed instead is that of intuiting linguistic
data. Put simply, intuiting data means inventing sentences exemplifying the phe-
nomenon under investigation and then judging their grammaticality (roughly,
whether the sentence is a possible sentence of the language in question). To put
it mildly, inventing one’s own data is a rather subjective procedure, so, again,
anyone unfamiliar with the last sixty years of linguistic theorizing might won-
der why such a procedure was proposed in the first place and why anyone would
consider it superior to the use of corpus data.
Readers familiar with this discussion or readers already convinced of the need
for corpus data may skip this chapter, as it will not be referenced extensively in
the remainder of this book. For all others, a discussion of both issues – the alleged
uselessness of corpus data and the alleged superiority of intuited data – seems
indispensable, if only to put them to rest in order to concentrate, throughout
the rest of this book, on the vast potential of corpus linguistics and the exciting
avenues of research that it opens up.
Section 1.1 will discuss four major points of criticisms leveled at corpus data.
As arguments against corpus data, they are easily defused, but they do point
to aspects of corpora and corpus linguistic methods that must be kept in mind
when designing linguistic research projects. Section 1.2 will discuss intuited data
in more detail and show that it does not solve any of the problems associated
(rightly or wrongly) with corpus data. Instead, as Section 1.3 will show, intuited
data actually creates a number of additional problems. Still, intuitions we have
about our native language (or other languages we speak well) can nevertheless
be useful in linguistic research – as long as we do not confuse them with “data”.
1. corpora are usage data and thus of no use in studying linguistic knowledge;
2. corpora and the data derived from them are necessarily incomplete;
2
1.1 Arguments against corpus data
I will discuss the first three points in the remainder of this section. A fruitful
discussion of the fourth point requires a basic understanding of statistics, which
will be provided in Chapters 5 and 6, so I will postpone it and come back to it in
Chapter 8.
The speaker has represented in his brain a grammar that gives an ideal
account of the structure of the sentences of his language, but, when actually
faced with the task of speaking or “understanding”, many other factors act
upon his underlying linguistic competence to produce actual performance.
He may be confused or have several things in mind, change his plans in
midstream, etc. Since this is obviously the condition of most actual linguistic
performance, a direct record – an actual corpus – is almost useless, as it stands,
for linguistic analysis of any but the most superficial kind (Chomsky 1964: 36,
emphasis mine).
This argument may seem plausible at first glance, but it is based on at least
one of two assumptions that do not hold up to closer scrutiny: first, that there
is an impenetrable bi-directional barrier between competence and performance,
and second, that the influence of confounding factors on linguistic performance
cannot be identified in the data.
The assumption of a barrier between competence and performance is a central
axiom in generative linguistics, which famously assumes that language acquisi-
tion depends on input only minimally, with an innate “universal grammar” doing
3
1 The need for corpus data
most of the work. This assumption has been called into question by a wealth of
recent research on language acquisition (see Tomasello 2003 for an overview).
But even if we accept the claim that linguistic competence is not derived from
linguistic usage, it would seem implausible to accept the converse claim that lin-
guistic usage does not reflect linguistic competence (if it did not, this would raise
the question what we need linguistic competence for at all).
This is where the second assumption comes into play. If we believe that lin-
guistic competence is at least broadly reflected in linguistic performance, as I
assume any but the most hardcore generativist theoreticians do, then it should
be possible to model linguistic knowledge based on observations of language
use – unless there are unidentifiable confounding factors distorting performance,
making it impossible to determine which aspects of performance are reflections
of competence and which are not. Obviously, confounding factors exist – the
confusion and the plan-changes that Chomsky mentions, but also others like
tiredness, drunkenness and all the other external influences that potentially in-
terfere with speech production. However, there is no reason to believe that these
factors and their distorting influence cannot be identified and taken into account
when drawing conclusions from linguistic corpora.1
Corpus linguistics is in the same situation as any other empirical science with
respect to the task of deducing underlying principles from specific manifestations
influenced by other factors. For example, Chomsky has repeatedly likened lin-
guistics to physics, but physicists searching for gravitational waves do not reject
the idea of observational data on the basis of the argument that there are “many
other factors acting upon fluctuations in gravity” and that therefore “a direct
record of such fluctuations is almost useless”. Instead, they attempt to identify
these factors and subtract them from their measurements.
In any case, the gap between linguistic usage and linguistic knowledge would
be an argument against corpus data only if there were a way of accessing linguis-
tic knowledge directly and without the interference of other factors. Sometimes,
intuited data is claimed to fit this description, but as I will discuss in Section 1.2.1,
not even Chomsky himself subscribes to this position.
1
In fact, there is an entire strand of experimental and corpus-based research that not only takes
disfluencies, hesitation, repairs and similar phenomena into account, but actually treats them
as object of study in their own right. The body of literature produced by this research is so
large that it makes little sense to even begin citing it in detail here, but cf. Kjellmer (2003),
Corley & Stewart (2008) and Gilquin & De Cock (2011) for corpus-based approaches.
4
1.1 Arguments against corpus data
Let us set aside for now the problems associated with the idea of grammatical-
ity and simply replace the word grammatical with conventionally occurring (an
equation that Chomsky explicitly rejects). Even the resulting, somewhat weaker
statement is quite clearly true, and will remain true no matter how large a corpus
we are dealing with. Corpora are incomplete in at least two ways.
First, corpora – no matter how large – are obviously finite, and thus they can
never contain examples of every linguistic phenomenon. As an example, consider
the construction [it doesn’t matter the N] (as in the lines It doesn’t matter the
colour of the car / But what goes on beneath the bonnet from the Billy Bragg song
A Lover Sings).2 There is ample evidence that this is a construction of British
English. First, Bragg, a speaker of British English, uses it in a song; second, most
native speakers of English will readily provide examples if asked; third, as the
examples in (1) show, a simple web query for ⟨ "it doesn't matter the" ⟩
will retrieve hits that have clearly been produced by native speakers of British
English and other varieties (note that I enclose corpus queries in angled brackets
in order to distinguish them from the linguistic expressions that they are meant
to retrieve from the corpus):
(1) a. It doesn’t matter the reasons people go and see a film as long as they go
and see it. (thenorthernecho.co.uk)
b. Remember, it doesn’t matter the size of your garden, or if you live in a
flat, there are still lots of small changes you can make that will benefit
wildlife. (avonwildlifetrust.org.uk)
2
Note that this really is a grammatical construction in its own right, i.e., it is not a case of
right-dislocation (as in It doesn’t matter, the color or It is not important, the color). In cases of
right-dislocation, the pronoun and the dislocated noun phrase are co-referential and there is
an intonation break before the NP (in standard English orthographies, there is a comma before
the NP). In the construction in question, the pronoun and the NP are not co-referential (it
functions as a dummy subject) and there is no intonation break (cf. Michaelis & Lambrecht
1996 for a detailed (non-corpus-based) analysis of the very similar [it BE amazing the N]).
5
1 The need for corpus data
c. It doesn’t matter the context. In the end, trust is about the person ex-
tending it. (clocurto.us)
d. It doesn’t matter the color of the uniform, we all work for the greater
good. (fw.ky.gov)
(2) a. Because he was a skunk and a stool pigeon ... I croaked him just as he
was goin’ to call the bulls with a police whistle ... (Veiller, Within the
Law)
b. [Use] your bean. If I had croaked the guy and frisked his wallet, would
I have left my signature all over it? (Stout, Some Buried Cesar)
c. I recall pointing to the loaded double-barreled shotgun on my wall and
replying, with a smile, that I would croak at least two of them before
they got away. (Thompson, Hell’s Angels)
Very roughly, we might characterize this variety as tough guy talk, or perhaps
tough guy talk as portrayed in crime fiction (I have never come across an example
outside of this (sub-)genre). Neither of these varieties is prominent among the
6
1.1 Arguments against corpus data
text categories represented in the BNC, and therefore the transitive use of croak
‘die’ does not occur in this corpus.3
The incompleteness of linguistic corpora must therefore be accepted and kept
in mind when designing and using such a corpus (something I will discuss in
detail in the next chapter). However, it is not an argument against the use of
corpora, since any collection of data is necessarily incomplete. One important
aspect of scientific work is to build general models from incomplete data and re-
fine them as more data becomes available. The incompleteness of observational
data is not seen as an argument against its use in other disciplines, and the ar-
gument gained currency in linguistics only because it was largely accepted that
intuited data are more complete. I will argue in Section 1.2.2, however, that this
is not the case.
Corpus linguistics can only provide you with utterances (or written letter
sequences or character sequences or sign assemblages). To do cognitive lin-
guistics with corpus data, you need to interpret the data – to give it meaning.
The meaning doesn’t occur in the corpus data. Thus, introspection is always
used in any cognitive analysis of language [...] (Lakoff 2004).
Lakoff (and others putting forward this argument) are certainly right: if the
corpus itself was all we had, corpus linguistics would be reduced to the detection
of formal patterns (such as recurring combinations) in otherwise meaningless
strings of symbols.
There are cases where this is the best we can do, namely, when dealing with
documents in an unknown or unidentifiable language. An example is the Phais-
tos disc, a clay disk discovered in 1908 in Crete. The disc contains a series of
symbols that appear to be pictographs (but may, of course, have purely phono-
logical value), arranged in an inward spiral. These pictographs may or may not
3
A kind of pseudo-transitive use with a dummy object does occur, however: He croaked it mean-
ing ‘he died’, and of course the major use of croak (‘to speak with a creaky voice’) occurs
transitively.
7
1 The need for corpus data
present a writing system, and no one knows what language, if any, they may
represent (in fact, it is not even clear whether the disc is genuine or a fake). How-
ever, this has not stopped a number of scholars from linguistics and related fields
from identifying a number of intriguing patterns in the series of pictographs and
some general parallels to known writing systems (see Robinson (2002: ch. 11) for
a fairly in-depth popular account). Some of the results of this research are sug-
gestive and may one day enable us to identify the underlying language and even
decipher the message, but until someone does so, there is no way of knowing if
the theories are even on the right track.
It hardly seems desirable to put ourselves in the position of a Phaistos disc
scholar artificially, by excluding from our research designs our knowledge of
English (or whatever other language our corpus contains); it is quite obvious that
we should, as Lakoff (2004) says, interpret the data in the course of our analysis.
But does this mean that we are using introspection in the same way as someone
inventing sentences and judging their grammaticality?
I think not. We need to distinguish two different kinds of introspection: (i)
intuiting, i.e. practice of introspectively accessing one’s linguistic experience in
order to create sentences and assign grammaticality judgments to them; and (ii)
interpreting, i.e. the practice of assigning an interpretation (in semantic and prag-
matic terms) to an utterance. These are two very different activities, and there is
good reason to believe that speakers are better at the second activity than at the
first: interpreting linguistic utterances is a natural activity – speakers must inter-
pret everything they hear or read in order to understand it; inventing sentences
and judging their grammaticality is not a natural activity – speakers never do it
outside of papers on grammatical theory. Thus, one can take the position that
interpretation has a place in linguistic research but intuition does not. Neverthe-
less, interpretation is a subjective activity and there are strict procedures that
must be followed when including its results in a research design. This issue will
be discussed in more detail in Chapter 4.
As with the two points of criticism discussed in the preceding subsections, the
problem of interpretation would be an argument against the use of corpus data
only if there were a method that avoids interpretation completely or that at least
allows for interpretation to be made objective.
1.2 Intuition
Intuited data would not be the only alternative to corpus data, but it is the one
proposed and used by critics of the latter, so let us look more closely at this
8
1.2 Intuition
[A]mong the kinds of experiments that can be done on language, one kind is
very simple, reliable, and cheap: simply present native speakers of a language
with a sentence or phrase, and ask them to judge whether or not it is grammat-
ical in their language or whether it can have some particular meaning. [...]
The idea is that although we can’t observe the mental grammar of English
itself, we can observe the judgments of grammaticality and meaning that
are produced by using it (Jackendoff 1994: 47, emphasis mine).
Ideally, we might want to check these experiments out by asking large num-
bers of people under controlled circumstances, and so forth. But in fact the
method is so reliable that, for a very good first approximation, linguists tend
to trust their own judgments and those of their colleagues (Jackendoff 1994:
48).
It is certainly true that linguists trust their own judgments, but that does not
mean, of course, that this trust is justified. There is little evidence that individual
grammaticality judgments are reliable: in the linguistic literature, grammaticality
judgments of the same sentences by different authors often differ considerably
and the few studies that have investigated the reliability of grammaticality judg-
ments have consistently shown that such judgments display too much variation
within and across speakers to use them as linguistic data (cf., e.g., Schütze (1996)
(reissued under a Creative-Commons license by Language Science Press in 2016),
esp. Ch. 3 on factors influencing grammaticality judgments, and Cowart (1997)).
9
1 The need for corpus data
However, the fact that something can be done quickly and effortlessly does
not make it a good scientific method. If one is serious about using grammaticality
judgments – and there are research questions that are not easily addressed with-
out them –, then these judgments must be made as reliable as possible; among
other things, this involves the two aspects mentioned by Jackendoff in passing:
first, asking large numbers of speakers (or at least more than one) and, second,
controlling the circumstances under which they are asked (cf. Schütze 1996 and
Cowart 1997 for detailed suggestions as to how this is to be done and Bender
2005 for an interesting alternative; cf. also Section 4.2.3 in Chapter 4). In order
to distinguish such empirically collected introspective data from data intuited
by the researcher, I will refer to the former as elicitation data and continue to
reserve for the latter the term intuition or intuited “data”.
In sum, there are serious problems with the reliability of linguistic intuition
in general, a point I will briefly return to in Section 1.3. In the case of isolated
judgments by the researchers themselves, these problems are compounded by two
additional ones: first, the researchers are language experts, whose judgments will
hardly be representative of the average native speaker – as Ronald Langacker
has quipped (in an example sentence meant to illustrate syntactic complexity):
“Linguists are no different from any other people who spend nineteen hours a
day pondering the complexity of grammar [...]” (Langacker 1973: 109). Second,
they will usually know what it is that they want to prove, and this will distort
their judgments. Thus, expert judgments should be used with extreme caution (cf.
Labov 1996) if at all (Schütze 1996), instead of serving as the default methodology
in linguistics.
Let us return to the second assumption in the passage quoted above – that
grammaticality judgments are transparently related to the mental grammar of
the speaker producing them. In particular, let us discuss whether intuited “data”
fare better than corpus data in terms of the three major points of criticism dis-
cussed in the preceding section:
10
1.2 Intuition
11
1 The need for corpus data
corpus is finite in size) and in a qualitative sense (since even the most carefully
constructed corpus is skewed with respect to the language varieties it contains).
This incompleteness is not an argument against using corpora as such, but it
might be an argument in favor of intuited judgments if there was reason to be-
lieve that they are more complete.
To my knowledge, this issue has never been empirically addressed, and it
would be difficult to do so, since there is no complete data set against which intu-
ited judgments could be compared. However, it seems implausible to assume that
such judgments are more complete than corpus data. First, just like a corpus, the
linguistic experience of a speaker is finite and any mental generalizations based
on this experience will be partial in the same way that generalizations based on
corpus data must be partial (although it must be admitted that the linguistic ex-
perience a native speaker gathers over a lifetime exceeds even a large corpus
like the BNC in terms of quantity). Second, just like a corpus, a speaker’s lin-
guistic experience is limited to certain language varieties: most English speakers
have never been to confession or planned an illegal activity, for example, which
means they will lack knowledge of certain linguistic structures typical of these
situations.
To exemplify this point, consider that many speakers of English are unaware
of the fact that there is a use of the verb bring that has the valency pattern (or
subcategorization frame) [bring NPLIQUID [PP to the boil]] (in British English)
or [bring NPLIQUID [PP to a boil]] (in American English). This use is essentially
limited to a single genre, – recipes: of the 145 matches in the BNC, 142 occur in
recipes and the remaining three in narrative descriptions of someone following a
recipe. Thus, a native speaker of English who never reads cookbooks or cooking-
related journals and websites and never watches cooking shows on television
can go through their whole life without encountering the verb bring used in this
way. When describing the grammatical behavior of the verb bring based on their
intuition, this use would not occur to them, and if they were asked to judge the
grammaticality of a sentence like Half-fill a large pan with water and bring to the
boil [BNC A7D], they would judge it ungrammatical. Thus, this valency pattern
would be absent from their description in the same way that transitive croak ‘die’
or [it doesn’t matter the N] would be absent from a grammatical description based
on the BNC (where, as we saw in Section 1.1.2, these patterns do not occur).
If this example seems too speculative, consider Culicover’s analysis of the
phrase no matter (Culicover 1999: 106f.). Culicover is an excellent linguist by any
standard, but he bases his intricate argument concerning the unpredictable na-
ture of the phrase no matter on the claim that the construction [it doesn’t matter
the N] is ungrammatical. If he had consulted the BNC, he might be excused for
12
1.2 Intuition
(3) a. When she’d first moved in she hadn’t cared about anything, certainly
not her surroundings – they had been the least of her problems – and
if the villagers hadn’t so kindly donated her furnishings she’d probably
still be existing in empty rooms. (BNC H9V)
4
Culicover is a speaker of American English, so if he were writing his book today, he might
check the 450-Million-word Corpus of Contemporary American English (COCA), first released
in 2008, instead of the BNC. If he did, he would find a dozen or more instances of the construc-
tion, depending which version he were to use – for example It doesn’t matter the number of
zeros they attach to it, from a 1997 transcript of ABC Nightline –, so he would not have to rely
on his incomplete native-speaker intuition.
13
1 The need for corpus data
The grammaticality of the clause [T]he villagers [...] donated her furnishings in
(3a) can be judged for its grammaticality only after disambiguating between the
meanings associated with the structures in (3b) and (3c).
The structure in (3b) is a ditransitive, which is widely agreed to be impossi-
ble with donate (but see Stefanowitsch 2007a), so the sentence would be judged
ungrammatical under this reading by the vast majority of English speakers. The
structure in (3c), in contrast, is a simple transitive, which is one of the two most
frequent valency patterns for donate, so the sentence would be judged grammat-
ical by all English speakers. The same would obviously be true if the sentence
was constructed rather than taken from a corpus.
But the semantic considerations that increase or decrease our willingness to
judge an utterance as grammatical are frequently more subtle than the difference
between the readings in (3b) and (3c).
Consider the example in (3d), which contains a clear example of donate with
the supposedly ungrammatical ditransitive valency pattern. Since this is an au-
thentic example, we cannot simply declare it ungrammatical; instead, we must
look for properties that distinguish this example from more typical uses of do-
nate and try to arrive an an explanation for such exceptional, but possible uses.
In Stefanowitsch (2007a), looking at a number of such exceptional uses, I suggest
that they may be made possible by the highly untypical sense in which the verb
donate is used here. In (3d) and other ditransitive uses, donate refers to a direct
transfer of something relatively valueless from one individual to another in a
situation of personal contact. This is very different from the typical use, where
a sum of money is transferred from an individual to an organization without
personal contact. If this were an intuited example, I might judge it grammatical
(at least marginally so) for similar reasons, while another researcher, unaware
of my subtle reconceptualization, would judge it ungrammatical, leading to no
insights whatsoever into the semantics of the verb donate or the valency patterns
it occurs in.
14
1.3 Intuition data vs. corpus data
1. data reliability (roughly, how sure can we be that other people will arrive
at the same set of data using the same procedures);
As to the first criterion, note that the problem is not that intuition “data” are
necessarily wrong. Very often, intuitive judgments turn out to agree very well
with more objective kinds of evidence, and this should not come as a surprise.
After all, as native speakers of a language, or even as advanced foreign-language
speakers, we have considerable experience with using that language actively
(speaking and writing) and passively (listening and reading). It would thus be
surprising if we were categorically unable to make statements about the proba-
bility of occurrence of a particular expression.
Instead, the problem is that we have no way of determining introspectively
whether a particular piece of intuited “data” is correct or not. To decide this,
we need objective evidence, obtained either by serious experiments (including
elicitation experiments) or by corpus-linguistic methods. But if that is the case,
the question is why we need intuition “data” in the first place. In other words,
intuition “data” are simply not reliable.
The second criterion provides an even more important argument, perhaps the
most important argument, against the practice of intuiting. Note that even if we
manage to solve the problem of reliability (as systematic elicitation from a rep-
resentative sample of speakers does to some extent), the epistemological status
of intuitive data remains completely unclear. This is particularly evident in the
5
Readers who are well-versed in methodological issues are asked to excuse this somewhat ab-
breviated use of the term validity; there are, of course, a range of uses in the philosophy of
science and methodological theory for the term validity (we will encounter a different use from
the one here in Chapters 2.3 and 4).
15
1 The need for corpus data
16
1.4 Corpus data in other sub-disciplines of linguistics
17
1 The need for corpus data
individual dialects that are based on introspective data – the description of the
grammar of African-American English in Green (2002) is an impressive example.
But in the study of actual variation, systematically collected survey data (e.g.
Labov et al. 2006) and corpus data in conjunction with multivariate statistics
(e.g. Tagliamonte 2006) were considered the natural choice of data long before
their potential was recognized in other areas of linguistics.
The same is true of conversation and discourse analysis. One could theoreti-
cally argue that our knowledge of our native language encompasses knowledge
about the structure of discourse and that this knowledge should be accessible to
introspection in the same way as our knowledge of grammar. However, again,
no conversation or discourse analyst has ever actually taken this line of argu-
mentation, relying instead on authentic usage data.6
Even lexicographers, who could theoretically base their descriptions of the
meaning and grammatical behavior of words entirely on the introspectively ac-
cessed knowledge of their native language have not generally done so. Beginning
with the Oxford English Dictionary (OED), dictionary entries have been based
at least in part on citations – authentic usage examples of the word in question
(see Chapter 2).
If the incompleteness of linguistic corpora or the fact that corpus data have
to be interpreted were serious arguments against their use, these sub-disciplines
of linguistics should not exist, or at least, they should not have yielded any use-
ful insights into the nature of language change, language acquisition, language
variation, the structure of linguistic interactions or the lexicon. Yet all of these
disciplines have, in fact, yielded insightful descriptive and explanatory models
of their respective research objects.
The question remains, then, why grammatical theory is the only sub-discipline
of linguistics whose practitioners have rejected the common practice of building
models of underlying principles on careful analyses of observable phenomena. If
I were willing to speculate, I would consider the possibility that the rejection of
corpora and corpus-linguistic methods in (some schools of) grammatical theoriz-
ing are based mostly on a desire to avoid having to deal with actual data, which
are messy, incomplete and often frustrating, and that the arguments against the
use of such data are, essentially, post-hoc rationalizations. But whatever the case
6
Perhaps Speech Act Theory could be seen as an attempt at discourse analysis on the basis of
intuition data: its claims are often based on short snippets of invented conversations. The dif-
ference between intuition data and authentic usage data is nicely demonstrated by the contrast
between the relatively broad but superficial view of linguistic interaction found in philosoph-
ical pragmatics and the rich and detailed view of linguistic interaction found in Conversation
Analysis (e.g. Sacks et al. 1974, Sacks 1992) and other discourse-analytic traditions.
18
1.4 Corpus data in other sub-disciplines of linguistics
may be, we will, at this point, simply stop worrying about the wholesale rejection
of corpus linguistics by some researchers until the time that they come up with
a convincing argument for this rejection, and turn to a question more pertinent
to this book: what exactly constitutes corpus linguistics?
19
2 What is corpus linguistics?
Although corpus-based studies of language structure can look back on a tradition
of at least a hundred years, there is no general agreement as to what exactly con-
stitutes corpus linguistics. This is due in part to the fact that the hundred-year
tradition is not an unbroken one. As we saw in the preceding chapter, corpora
fell out of favor just as linguistics grew into an academic discipline in its own
right and as a result, corpus-based studies of language were relegated to the mar-
gins of the field. While the work on corpora and corpus-linguistic methods never
ceased, it has returned to a more central place in linguistic methodology only
relatively recently. It should therefore come as no surprise that it has not, so far,
consolidated into a homogeneous methodological framework. More generally,
linguistics itself, with a tradition that reaches back to antiquity, has remained
notoriously heterogeneous discipline with little agreement among researchers
even with respect to fundamental questions such as what aspects of language
constitute their object of study (recall the brief remarks at the beginning of the
preceding chapter). It is not surprising, then, that they do not agree how their
object of study should be approached methodologically and how it might be mod-
eled theoretically. Given this lack of agreement, it is highly unlikely that a unified
methodology will emerge in the field any time soon.
On the one hand, this heterogeneity is a good thing. The dogmatism that comes
with monolithic theoretical and methodological frameworks can be stifling to the
curiosity that drives scientific progress, especially in the humanities and social
sciences which are, by and large, less mature descriptively and theoretically than
the natural sciences. On the other hand, after more than a century of scientific
inquiry in the modern sense, there should no longer be any serious disagree-
ment as to its fundamental procedures, and there is no reason not to apply these
procedures within the language sciences. Thus, I will attempt in this chapter to
sketch out a broad, and, I believe, ultimately uncontroversial characterization of
corpus linguistics as an instance of the scientific method. I will develop this pro-
posal by successively considering and dismissing alternative characterizations
of corpus linguistics. My aim in doing so is not to delegitimize these alternative
characterizations, but to point out ways in which they are incomplete unless they
are embedded in a principled set of ideas as to what it means to study language
scientifically.
2 What is corpus linguistics?
This definition is uncontroversial in that any research method that does not
fall under it would not be regarded as corpus linguistics. However, it is also very
broad, covering many methodological approaches that would not be described
as corpus linguistics even by their own practitioners (such as discourse analysis
or citation-based lexicography). Some otherwise similar definitions of corpus lin-
guistics attempt to be more specific in that they define corpus linguistics as “the
compilation and analysis of corpora.” (Cheng 2012: 6, cf. also Meyer 2002: xi),
suggesting that there is a particular form of recording “real-life language use”
called a corpus.
The first chapter of this book started with a similar definition, characterizing
corpus linguistics as “as any form of linguistic inquiry based on data derived
from [...] a corpus”, where corpus was defined as “a large collection of authentic
text”. In order to distinguish corpus linguistics proper from other observational
methods in linguistics, we must first refine this definition of a linguistic corpus;
this will be our concern in Section 2.1. We must then take a closer look at what
it means to study language on the basis of a corpus; this will be our concern in
Section 2.2.
22
2.1 The linguistic corpus
In addition, the texts in such a collection are often (but not always) annotated
in order to enhance their potential for linguistic analysis. In particular, they may
contain information about paralinguistic aspects of the original data (intonation,
font style, etc.), linguistic properties of the utterances (parts of speech, syntactic
structure), and demographic information about the speakers/writers.
To distinguish this type of collection from other collections of texts, we will
refer to it as a linguistic corpus, and the term corpus will always refer to a linguistic
corpus in this book unless specified otherwise.
Let us now discuss each of these criteria in turn, beginning with authenticity.
2.1.1 Authenticity
The word authenticity has a range of meanings that could be applied to language
– it can mean that a speaker or writer speaks true to their character (He has found
his authentic voice), or to the character of the group they belong to (She is the
authentic voice of her generation), that a particular piece of language is correctly
attributed (This is not an authentic Lincoln quote), or that speech is direct and
truthful (the authentic language of ordinary people).
In the context of corpus linguistics (and often of linguistics in general), authen-
ticity refers much more broadly to what McEnery & Wilson (2001) call “real life
language use”. As Sinclair puts it, an authentic corpus is one in which
23
2 What is corpus linguistics?
language use. In contrast, performances for the linguist are assumed to distort
language behavior in ways that makes them unsuitable for linguistic analysis.
In the case of written language, the criterion of authenticity is easy to satisfy.
Writing samples can be collected after the fact, so that there is no way for the
speakers to know that their language will come under scientific observation. In
the case of spoken language, the “minimum disruption” that Sinclair mentions
becomes relevant. We will return to this issue and its consequences for authen-
ticity presently, but first let us discuss some general problems with the corpus
linguist’s broad notion of authenticity.
Widdowson (2000), in the context of discussing the use of corpora in the lan-
guage classroom, casts doubt on the notion of authenticity for what seems, at
first, to be a rather philosophical reason:
The texts which are collected in a corpus have a reflected reality: they are
only real because of the presupposed reality of the discourses of which
they are a trace. This is decontexualized language, which is why it is only
partially real. If the language is to be realized as use, it has to be recontex-
tualized. (Widdowson 2000: 7)
In some sense, it is obvious that the texts in a corpus (in fact, all texts) are only
fully authentic as long as they are part of an authentic communicative situation.
A sample of spoken language is only authentic as part of the larger conversation
it is part of, a sample of newspaper language is only authentic as long as it is
produced in a newsroom and processed by a reader in the natural context of a
newspaper or news site for the purposes of informing themselves about the news,
and so on. Thus, the very act of taking a sample of language and including it in
a corpus removes its authenticity.
This rather abstract point has very practical consequences, however. First, any
text, spoken or written, will lose not only its communicative context (the dis-
course of which it was originally a part), but also some of its linguistic and par-
alinguistic properties when it becomes part of a corpus. This is most obvious in
the case of transcribed spoken data, where the very act of transcription means
that aspects like tone of voice, intonation, subtle aspects of pronunciation, facial
expressions, gestures, etc. are replaced by simplified descriptions or omitted al-
together. It is also true for written texts, where, for example, visual information
about the font, its color and size, the position of the text on the page, and the tac-
tile properties of the paper are removed or replaced by descriptions (see further
Section 2.1.4 below).
24
2.1 The linguistic corpus
The corpus linguist can attempt to supply the missing information introspec-
tively, “recontextualizing” the text, as Widdowson puts it. But since they are
not in an authentic setting (and often not a member of the same cultural and
demographic group as the original or originally intended hearer/reader), this re-
contextualization can approximate authenticity at best.
Second, texts, whether written or spoken, may contain errors that were pres-
ent in the original production or that were introduced by editing before publi-
cation or by the process of preparing them for inclusion in the corpus (cf. also
Emons 1997). As long as the errors are present in the language sample before
it is included in the corpus, they are not, in themselves, problematic: errors are
part of language use and must be studied as such (in fact, the study of errors has
yielded crucial insights into language processing, cf., for example, Fromkin 1973
and Fromkin 1980). The problem is that the decision as to whether some bit of
language contains an error is one that the researcher must make by reconceptu-
alizing the speaker and their intentions in the original context, a reconceptual-
ization that makes authenticity impossible to determine.
This does not mean that corpora cannot be used. It simply means that limits of
authenticity have to be kept in mind. With respect to spoken language, however,
there is a more serious problem – Sinclair’s “minimum disruption”.
The problem is that in observational studies no disruption is ever minimal –
as soon as the investigator is present in person or in the minds of the observed,
we get what is known as the “observer’s paradox”: we want to observe people (or
other animate beings) behaving as they would if they were not observed – in the
case of gathering spoken language data, we want to observe speakers interacting
linguistically as they would if no linguist was in sight.
In some areas of study, it is possible to circumvent this problem by hiding (or
installing hidden recording devices), but in the case of human language users
this is impossible: it is unethical as well as illegal in most jurisdictions to record
people without their knowledge. Speakers must typically give written consent
before the data collection can begin, and there is usually a recording device in
plain view that will constantly remind them that they are being recorded.
This knowledge will invariably introduce a degree of inauthenticity into the
data. Take the following excerpts from the Bergen Corpus of London Teenage Lan-
guage (COLT). In the excerpt in (1), the speakers are talking about the recording
device itself, something they would not do in other circumstances:
25
2 What is corpus linguistics?
(1) A: Josie?
B: Yeah. [laughs] I’m not filming you, I’m just taping you. [...]
A: Yeah, I’ll take your little toy and smash it to pieces!
C: Mm. Take these back to your class. [COLT B132611]
In the excerpt in (2), speaker A explains to their interlocutor the fact that the
conversation they are having will be used for linguistic research:
A speaker’s knowledge that they are being recorded for the purposes of lin-
guistic analysis is bound to distort the data even further. In example (3), there
is evidence for such a distortion – the speakers are performing explicitly for the
recording device:
Speaker A asks for “true things” and then imitates an interview situation,
which speaker B takes up by using the somewhat formal phrase I protest, which
they presumably would not use in an authentic conversation about their love
life.
Obviously, such distortions will be more or less problematic depending on
our research question. Level of formality (style) may be easier to manipulate
in performing for the linguist than pronunciation, which is easier to manipulate
than morphological or syntactic behavior. However, the fact remains that spoken
data in corpora are hardly ever authentic in the corpus-linguistic sense (unless
it is based on recordings of public language use, for example, from television or
26
2.1 The linguistic corpus
the radio), and the researcher must rely, again, on an attempt to recontextualize
the data based on their own experience as a language user in order to identify
possible distortions. There is no objective way of judging the degree of distortion
introduced by the presence of an observer, since we do not have a sufficiently
broad range of surreptitiously recorded data for comparison.
There is one famous exception to the observer’s paradox in spoken language
data: the so-called Nixon Tapes – illegal surreptitious recordings of conversation
in the executive offices of the White House and the headquarters of the opposing
Democratic Party produced at the request of the Republican President Richard
Nixon between February 1971 and July 1973. Many of these tapes are now avail-
able as digitized sound files and/or transcripts (see, for example, Nichter 2007).
In addition to the interest they hold for historians, they form the largest available
corpus of truly authentic spoken language.
However, even these recordings are too limited in size and topic area as well
as in the diversity of speakers recorded (mainly older white American males), to
serve as a standard against which to compare other collections of spoken data.
The ethical and legal problems in recording unobserved spoken language can-
not be circumvented, but their impact on the authenticity of the recorded lan-
guage can be lessened in various ways – for example, by getting general consent
from speakers, but not telling them when precisely they will be recorded.
Researchers may sometimes deliberately choose to depart from authenticity
in the corpus-linguistic sense if their research design or the phenomenon under
investigation requires it. A researcher may be interested in a phenomenon that is
so rare in most situations that even the largest available corpora do not contain a
sufficient number of cases. These may be structural phenomena (like the pattern
[It doesn’t matter the N] or transitive croak, discussed in the previous chapter), or
unusual communicative situations (for example, human-machine interaction).
In such cases, it may be necessary to switch methods and use some kind of
grammaticality judgments after all, but it may also be possible to elicit these phe-
nomena in what we could call semi-authentic settings. For example, researchers
interested in motion verbs often do not have the means (or the patience) to collect
these verbs from general corpora, or corpora may not contain a sufficiently broad
range of descriptions of motion events with particular properties. Such descrip-
tions are sometimes elicited by asking speakers to describe movie snippets or
narrate a story from a picture book (cf. e.g. Berman & Slobin 1994; Strömqvist &
Verhoeven 2003). Human-machine interaction is sometimes elicited in so-called
“Wizard of Oz” experiments, where people believe they are talking to a robot, but
the robot is actually controlled by one of the researchers (cf. e.g. Georgila et al.
2010).
27
2 What is corpus linguistics?
2.1.2 Representativeness
Put simply, a representative sample is a subset of a population that is identical
to the population as a whole with respect to the distribution of the phenomenon
under investigation. Thus, for a corpus (a sample of language use) to be represen-
tative of a particular language, the distribution of linguistic phenomena (words,
grammatical structures, etc.) would have to be identical to their distribution in
the language as a whole (or in the variety under investigation, see further below).
The way that corpus creators typically aim to achieve this is by including in
the corpus different manifestations of the language it is meant to represent in
proportions that reflect their incidence in the speech community in question.
Such a corpus is sometimes referred to as a balanced corpus.
Before we can discuss this in more detail, a terminological note is in order. You
may have noted that in the preceding discussion I have repeatedly used terms
like language variety, genre, register and style for different manifestations of lan-
guage. The precise usage of these terms notoriously vary across subdisciplines
of linguistics and individual researchers, including the creators of corpora.
In this book, I use language variety to refer to any form of language deline-
able from other forms along cultural, linguistic or demographic criteria. In other
words, I use it as a superordinate term for text-linguistic terms like genre, register,
style, and medium as well as sociolinguistic terms like dialect, sociolect, etc. With
respect to what I am calling text-linguistic terms here, I follow the usage sug-
gestions synthesized by Lee (2001) and use genre for culturally defined and rec-
ognized varieties, register for varieties characterized by a particular “functional
configuration” (roughly, a bundle of linguistic features associated with a particu-
lar social function), style to refer to the degrees of formality (e.g. formal, informal,
28
2.1 The linguistic corpus
colloquial, humorous, etc.), and medium to refer to the material manifestation (es-
sentially, spoken and written with subtypes of these). I use the term topic (area)
to refer to the content of texts or the discourse domain from which they come.
When a particular variety, defined by one or more of the dimensions just men-
tioned, is included in a given corpus, I refer to it as a text category of that corpus.
For a corpus to be representative (or “balanced”), its text categories should ac-
curately reflect both quantitatively and qualitatively the language varieties found
in the speech community whose language is represented in the corpus. However,
it is clear that this is an ideal that is impossible to achieve in reality for at least
four reasons.
First, for most potentially relevant parameters we simply do not know how
they are distributed in the population. We may know the distribution of some
of the most important demographic variables (e.g. sex, age, education), but we
simply do not know the overall distribution of spoken vs. written language, press
language vs. literary language, texts and conversations about particular topics,
etc.
Second, even if we did know, it is not clear that all manifestations of language
use shape and/or represent the linguistic system in the same way, simply because
we do not know how widely they are received. For example, emails may be re-
sponsible for a larger share of written language produced in a given time span
than news sites, but each email is typically read by a handful of people at the
most, while some news texts may be read by millions of people (and others not
at all).
Third, in a related point, speech communities are not homogeneous, so defin-
ing balance based on the proportion of language varieties in the speech com-
munity may not yield a realistic representation of the language even if it were
possible: every member of the speech community takes part in different com-
municative situations involving different language varieties. Some people read
more than others, and among these some read mostly newspapers, others mostly
novels; some people watch parliamentary debates on TV all day, others mainly
talk to customers in the bakery where they work. In other words, the proportion
of language varieties that speakers encounter varies, requiring a notion of bal-
ance based on the incidence of language varieties in the linguistic experience of a
typical speaker. This, in turn, requires a definition of what constitutes a typical
speaker in a given speech community. Such a definition may be possible, but to
my knowledge, does not exist so far.
Finally, there are language varieties that are impossible to sample for practical
reasons – for example, pillow talk (which speakers will be unwilling to share be-
cause they consider it too private), religious confessions or lawyer-client conver-
29
2 What is corpus linguistics?
sations (which speakers are prevented from sharing because they are privileged),
and the planning of illegal activities (which speakers will want to keep secret in
order to avoid lengthy prison terms).
Representativeness or balancedness also plays a role if we do not aim at inves-
tigating a language as a whole, but are instead interested in a particular variety.
In this case, the corpus will be deliberately skewed so as to contain only samples
of the variety under investigation. However, if we plan to generalize our results
to that variety as a whole, the corpus must be representative of that variety. This
is sometimes overlooked. For example, there are studies of political rhetoric that
are based on speeches by just a handful of political leaders (cf., e.g., Charteris-
Black 2006; 2005) or studies of romantic metaphor based on a single Shakespeare
play (Barcelona Sánchez 1995). While such studies can be insightful with respect
to the language of the individuals included in the corpus, their results are unlikely
to be generalizable even within the narrow variety under investigation (political
speeches, romantic tragedies). Thus, they belong to the field of literary criticism
or stylistics much more clearly than to the field of linguistics.
Given the problems discussed above, it seems impossible to create a linguistic
corpus meeting the criterion of representativeness. In fact, while there are very
well-thought out approaches to approximating representativeness (cf., e.g., Biber
1993), it is fair to say that most corpus creators never really try. Let us see what
they do instead.
The first linguistic corpus in our sense was the Brown University Standard
Corpus of Present-Day American English (generally referred to as BROWN). It
is made up exclusively of edited prose published in the year 1961, so it clearly
does not attempt to be representative of American English in general, but only
of a particular kind of written American English in a narrow time span. This is
legitimate if the goal is to investigate that particular variety, but if the corpus
were meant to represent the standard language in general (which the corpus
creators explicitly deny), it would force us to accept a very narrow understanding
of standard.
The BROWN corpus consists of 500 samples of approximately 2000 words
each, drawn from a number of different language varieties, as shown in Table 2.1.
The first level of sampling is, roughly, by genre: there are 286 samples of non-
fiction, 126 samples of fiction and 88 samples of press texts. There is no reason
to believe that this corresponds proportionally to the total number of words pro-
duced in these language varieties in the USA in 1961. There is also no reason to
believe that the distribution corresponds proportionally to the incidence of these
language varieties in the linguistic experience of a typical speaker. This is true all
the more so when we take into account the second level of sampling within these
30
2.1 The linguistic corpus
31
2 What is corpus linguistics?
The list of main categories and their subdivisions was drawn up at a con-
ference held at Brown University in February 1963. The participants in the
conference also independently gave their opinions as to the number of
samples there should be in each category. These figures were averaged to
obtain the preliminary set of figures used. A few changes were later made
32
2.1 The linguistic corpus
33
2 What is corpus linguistics?
researchers would not have used it as a basis for the substantial effort involved
in corpus creation.
More recent corpora at first glance appear to take a more principled approach
to representativeness or balance. Most importantly, they typically include not
just written language, but also spoken language. However, a closer look reveals
that this is the only real change. For example, the BNC Baby, a four-million-
word subset of the 100-million-word British National Corpus (BNC), includes
approximately one million words each from the text categories spoken conver-
sation, written academic language, written prose fiction and written newspaper
language (Table 2.2 shows the design in detail). Obviously, this design does not
correspond to the linguistic experience of a typical speaker, who is unlikely to
be exposed to academic writing and whose exposure to written language is un-
likely to be three times as large as their exposure to spoken language. The design
also does not correspond in any obvious way to the actual amount of language
produced on average in the four categories or the subcategories of academic and
newspaper language. Despite this, the BNC Baby, and the BNC itself, which is
even more drastically skewed towards edited written language, are extremely
successful corpora that are still widely used a quarter-century after the first re-
lease of the BNC.
Even what I would consider the most serious approach to date to creating a
balanced corpus design, the sampling schema of the International Corpus of En-
glish (ICE), is unlikely to be substantially closer to constituting a representative
sample of English language use (see Table 2.3).
It puts a stronger emphasis on spoken language – sixty percent of the corpus
are spoken text categories, although two thirds of these are public language use,
while for most of us private language use is likely to account for more of our
linguistic experience. It also includes a much broader range of written text cate-
gories than previous corpora, including not just edited writing but also student
writing and letters.
Linguists would probably agree that the design of the ICE corpora is “more
representative” than that of the BNC Baby, which is in turn “more representative”
than that of the BROWN corpus and its offspring. However, in light of the above
discussion of representativeness, there is little reason to believe that any of these
corpora, or the many others that fall somewhere between BROWN and ICE, even
come close to approximating a random sample of (a given variety of) English in
terms of the text categories they contain and the proportions with which they
are represented.
This raises the question as to why corpus creators go to the trouble of attempt-
ing to create representative corpora at all, and why some corpora seem to be more
successful attempts than others.
34
2.1 The linguistic corpus
It seems to me that, in fact, corpus creators are not striving for representa-
tiveness at all. The impossibility of this task is widely acknowledged in corpus
linguistics. Instead, they seem to interpret balance in terms of the related but
distinct property diversity. While corpora will always be skewed relative to the
overall population of texts and language varieties in a speech community, the
undesirable effects of this skew can be alleviated by including in the corpus as
broad a range of varieties as is realistic, either in general or in the context of a
given research project.
Unless language structure and language use are infinitely variable (which, at
a given point in time, they are clearly not), increasing the diversity of the sample
will increase representativeness even if the corpus design is not strictly propor-
tional to the incidence of text varieties or types of speakers found in the speech
community. It is important to acknowledge that this does not mean that diver-
sity and representativeness are the same thing, but given that representative cor-
pora are practically (and perhaps theoretically) impossible to create, diversity is
a workable and justifiable proxy.
35
2 What is corpus linguistics?
36
2.1 The linguistic corpus
2.1.3 Size
Like diversity, corpus size is also assumed, more or less explicitly, to contribute to
representativeness (e.g. McEnery & Wilson 2001: 78; Biber 2006: 251). The extent
of the relationship is difficult to assess. Obviously, sample size does correlate with
representativeness to some extent: if our corpus were to contain the totality of all
manifestations of a language (or variety of a language), it would necessarily be
representative, and this representativeness would not drop to zero immediately
if we were to decrease the sample size. However, it would drop rather rapidly – if
we exclude one percent of the totality of all texts produced in a given language,
entire language varieties may already be missing. For example, the Library of
Congress holds around 38 million print materials, roughly half of them in English.
A search for cooking in the main catalogue yields 7638 items that presumably in-
clude all cookbooks in the collection. This means that cookbooks make up no
more than 0.04 percent of printed English (7638/19000000 = 0.000402). Thus, they
could quickly be lost in their entirety when the sample size drops substantially
below the size of the population as a whole. And when a genre (or a language
variety in general) goes missing from our sample, at least some linguistic phe-
nomena will disappear along with it – such as the expression [bring NPLIQUID
[PP to the/a boil]], which, as discussed in Chapter 1, is exclusive to cookbooks.1
In the age of the World Wide Web, corpus size is practically limited only by
technical considerations. For example, the English data in the Google N-Grams
data base are derived from a trillion-word corpus (cf. Franz & Brants 2006). In
quantitative terms, this represents many times the linguistic input that a single
person would receive in their lifetime: an average reader can read between 200
and 250 words per minute, so it would take them between 7500 and 9500 years
of non-stop reading to get through the entire corpus. However, even this corpus
contains only a tiny fraction of written English, let alone of English as a whole.
Even more crucially, in terms of language varieties, it is limited to a narrow sec-
tion of published written English and does not capture the input of any actual
speaker of English at all.
There are several projects gathering very large corpora on a broader range
of web-accessible text. These corpora are certainly impressive in terms of their
size, even though they typically contain mere billions rather than trillions of
1
The expression actually occurs once in the BROWN corpus, which includes one 2000 word
sample from a cookbook, over-representing this genre by a factor of five, but not at all in the
LOB corpus. Thus, someone investigating the LOB corpus might not include this expression
in their description of English at all, someone comparing the two corpora would wrongly
conclude that it is limited to American English.
37
2 What is corpus linguistics?
words. However, their size is the only argument in their favor, as their creators
and their users must not only give up any pretense that they are dealing with a
representative corpus, but must contend with a situation in which they have no
idea what texts and language varieties the corpus contains and how much of it
was produced by speakers of English (or by human beings rather than bots).
These corpora certainly have their uses, but they push the definition of a lin-
guistic corpus in the sense discussed above to their limit. To what extent they
are representative cannot be determined. On the one hand, corpus size correlates
with representativeness only to the extent that we take corpus diversity into ac-
count. On the other hand, assuming (as we did above) that language structure
and use are not infinitely variable, size will correlate with the representativeness
of a corpus at least to some extent with respect to particular linguistic phenom-
ena (especially frequent phenomena, such as general vocabulary, and/or highly
productive processes such as derivational morphology and major grammatical
structures).
There is no principled answer to the question “How large must a linguistic
corpus be?”, except, perhaps, an honest “It is impossible to say” (Renouf 1987:
130). However, there are two practical answers. The more modest answer is that it
must be large enough to contain a sample of instances of the phenomenon under
investigation that is large enough for analysis (we will discuss what this means
in Chapters 5 and 6). The less modest answer is that it must be large enough
to contain sufficiently large samples of every grammatical structure, vocabulary
item, etc. Given that an ever increasing number of texts from a broad range of
language varieties is becoming accessible via the web, the second answer may
not actually be as immodest as it sounds.
Current corpora that at least make an honest attempt at diversity currently
range from one million (e.g. the ICE corpora mentioned above) to about half
a billion (e.g. the COCA mentioned in the preceding chapter). Looking at the
published corpus-linguistic literature, my impression is that for most linguistic
phenomena that researchers are likely to want to investigate, these corpus sizes
seem sufficient. Let us take this broad range as characterizing a linguistic corpus
for practical purposes.
2.1.4 Annotations
Minimally, a linguistic corpus consists simply of a large, diverse collection of
files containing authentic language samples as raw text, but more often than not,
corpus creators add one or more of three broad types of annotation:
38
2.1 The linguistic corpus
1. information about paralinguistic features of the text such as font style, size
and color, capitalization, special characters, etc. (for written texts), and in-
tonation, overlapping speech, length of pauses, etc. (for spoken text);
In this section, we will illustrate these types of annotation and discuss their
practical implications as well as their relation to the criterion of authenticity,
beginning with paralinguistic features, whose omission was already hinted at as
a problem for authenticity in Section 2.1.1 above.
For example, Figure 2.1 shows a passage of transcribed speech from the Santa
Barbara Corpus of Spoken American English (SBCSAE).
The speech is transcribed more or less in standard orthography, with some par-
alinguistic features indicated by various means. For example, the beginning of a
39
2 What is corpus linguistics?
40
2.1 The linguistic corpus
Like the SBCAE, the LLC also indicates overlapping speech (enclosing it in plus
signs as in lines 1430 and 1440 or in asterisks, as in lines 1520 and 1530), pauses (a
period for a “brief” pause, single hyphen for a pause the length of one “stress unit”
and two hyphens for longer pauses), and intonation units, called “tone units” by
the corpus creators (with a caret marking the onset and the number sign marking
the end).
In addition, however, intonation contours are recorded in detail preceding the
vowel of the prosodically most prominent syllable using the equals sign and right-
ward and leftward slashes: = stands for “level tone”, / for “rise”, \ for “fall”, \/ for
“(rise-)fall-rise” and /\ for “(fall-)rise-fall”. A colon indicates that the following
syllable is higher than the preceding one, an exclamation mark indicates that it
is very high. Occasionally, the LLC uses phonetic transcription to indicate an un-
expected pronunciation or vocalizations that have no standard spelling (like the
[@:] in line 1410 which stands for a long schwa).
The two corpora differ in their use of symbols to annotate certain features, for
example:
• the LLC indicates overlap by asterisks and plus signs, the SBCSAE by
square brackets, which, in turn, are used in the LLC to mark “subordinate
tone units” or phonetic transcriptions;
• the LLC uses periods and hyphens to indicate pauses, the SBCSAE uses
only periods, with hyphens used to indicate that an intonation unit is trun-
cated;
• intonation units are enclosed by the symbols ^ and # in the LLC and by
line breaks in the SBCSAE;
Thus, even where the two corpora annotate the same features of speech in the
transcriptions, they code these features differently.
Such differences are important to understand for anyone working with the
these corpora, as they will influence the way in which we have to search the cor-
pus (see further Section 4.1.1 below) – before working with a corpus, one should
always read the full manual. More importantly, such differences reflect differ-
ent, sometimes incompatible theories of what features of spoken language are
relevant, and at what level of detail. The SBCSAE and the LLC cannot easily be
combined into a larger corpus, since they mark prosodic features at very different
41
2 What is corpus linguistics?
levels of detail. The LLC gives detailed information about pitch and intonation
contours absent from the SBCSAE; in contrast, the SBCSAE contains information
about volume and audible breathing that is absent from the LLC.
Written language, too, has paralinguistic features that are potentially relevant
to linguistic research. Consider the excerpt from the LOB corpus in Figure 2.3.
The word anything in line 100 was set in italics in the original text; this is in-
dicated by the sequences *1, which stands for “begin italic” and *0, which stands
for “begin lower case (roman)” and thus ends the stretch set in italics. The origi-
nal text also contained typographic quotes, which are not contained in the ASCII
encoding used for the corpus. Thus, the sequence *" in line 100 stands for “begin
double quotes” and the sequence **" in line 101 stands for “end double quotes”.
ASCII also does not contain the dash symbol, so the sequence *- indicates a dash.
Finally, paragraph boundaries are indicated by a sequence of three blank spaces
followed by the pipe symbol | (as in lines 96 and 99), and more complex text
features like indentation are represented by descriptive tags, enclosed in square
brackets preceded by two asterisks (as in line 98 and 102, which signal the begin-
ning and end of an indented passage).
Additionally, the corpus contains markup pertaining not to the appearance of
the text but to its linguistic properties. For example, the word Mme in line 94 is
an abbreviation, indicated in the corpus by the sequence \0 preceding it. This
may not seem to contribute important information in this particular case, but it
is useful where abbreviations end in a period (as they often do), because it serves
to disambiguate such periods from sentence-final ones. Sentence boundaries are
also marked explicitly: each sentence begins with a caret symbol ^.
42
2.1 The linguistic corpus
Other corpora (and other versions of the LOB corpus) contain more detailed
linguistic markup. Most commonly, they contain information about the word
class of each word, represented in the form of a so-called “part-of-speech (or POS)
tags”. Figure 2.4 shows a passage from the BROWN corpus, where these POS tags
take the form of sequences of uppercase letters and symbols, attached to the end
of each word by an underscore (for example, _AT for articles, _NN for singular
nouns, _* for the negative particle not, etc.). Note that sentence boundaries are
also marked, in this case by a pipe symbol (used for paragraph boundaries in the
LOB) followed by the sequence SN and an id number.
Other linguistic features that are sometimes recorded in (written and spoken)
corpora are the lemmas of each word and (less often) the syntactic structure of
the sentences (corpora with syntactic annotation are sometimes referred to as
treebanks). When more than one variable is annotated in a corpus, the corpus
is typically structured as shown in Figure 2.5, with one word per line and dif-
ferent columns for the different types of annotation (more recently, the markup
language XML is used in addition to or instead of this format).
Annotations of paralinguistic or linguistic features in a corpus impact its au-
thenticity in complex ways.
On the one hand, including information concerning paralinguistic features
makes a corpus more authentic than it would be if this information was simply
discarded. After all, this information represents aspects of the original speech
events from which the corpus is derived and is necessary to ensure a reconcep-
tualization of the data that approximates these events as closely as possible.
On the other hand, this information is necessarily biased by the interests and
theoretical perspectives of the corpus creators. By splitting the spoken corpora
into intonation units, for example, the creators assume that there are such units
43
2 What is corpus linguistics?
44
2.1 The linguistic corpus
and that they are a relevant category in the study of spoken language. They
will also identify these units based on particular theoretical and methodological
assumptions, which means that different creators will come to different decisions.
The same is true of other aspects of spoken and written language. Researchers
using these corpora are then forced to accept the assumptions and decisions of
the corpus creators (or they must try to work around them).
This problem is even more obvious in the case of linguistic annotation. There
may be disagreements as to how and at what level of detail intonation should
be described, for example, but it is relatively uncontroversial that it consists of
changes in pitch. In contrast, it is highly controversial how many parts of speech
there are and how they should be identified, or how the structure even of simple
sentences is best described and represented. Accepting (or working around) the
corpus creators’ assumptions and decisions concerning POS tags and annotations
of syntactic structure may seriously limit or distort researcher’s use of corpora.
Also, while it is clear that speakers are at some level aware of intonation,
pauses, indentation, roman vs. italic fonts, etc., it is much less clear that they
are aware of parts of speech and grammatical structures. Thus, the former play a
legitimate role in reconceptualizing authentic speech situations, while the latter
arguably do not. Note also that while linguistic markup is often a precondition
for an efficient retrieval of data, error in markup may hide certain phenomena
systematically (see further Chapter 4, especially Section 4.1.1).
Finally, corpora typically give some information about the texts they contain
– so-called metadata. These may be recorded in a manual, a separate computer-
readable document or directly in the corpus files to which they pertain. Typical
metadata are language variety (in terms of genre, medium topic area, etc., as de-
scribed in Section 2.1.2 above), the origin of the text (for example, speaker/writer,
year of production and or publication), and demographic information about the
speaker/writer (sex, age, social class, geographical origin, sometimes also level
of education, profession, religious affiliation, etc.). Metadata may also pertain to
the structure of the corpus itself, like the file names, line numbers and sentence
or utterance ids in the examples cited above.
Metadata are also crucial in recontextualizing corpus data and in designing cer-
tain kinds of research projects, but they, too, depend on assumptions and choices
made by corpus creators and should not be uncritically accepted by researchers
using a given corpus.
45
2 What is corpus linguistics?
This definition is more specific with respect to the data used in corpus lin-
guistics and will exclude certain variants of discourse analysis, text linguistics,
and other fields working with authentic language data (whether such a strict ex-
clusion is a good thing is a question we will briefly return to at the end of this
chapter).
However, the definition says nothing about the way in which these data are
to be investigated. Crucially, it would cover a procedure in which the linguistic
corpus essentially serves as a giant citation file, that the researcher scours, more
or less systematically, for examples of a given linguistic phenomenon.
This procedure of basing linguistic analyses on citations has a long tradition
in descriptive English linguistics, going back at least to Otto Jespersen’s seven-
volume Modern English Grammar on Historical Principles (Jespersen 1909). It
played a particularly important role in the context of dictionary making. The
Oxford English Dictionary (Simpson & Weiner 1989) is the first and probably still
the most famous example of a citation-based dictionary of English. For the first
two editions, it relied on citations sent in by volunteers (cf. Winchester 2003 for
a popular account). In its current third edition, its editors actively search corpora
and other text collections (including the Google Books index) for citations.
A fairly stringent implementation of this method is described in the following
passage from the FAQ web page of the Merriam-Webster Online Dictionary:
46
2.2 Towards a definition of corpus linguistics
47
2 What is corpus linguistics?
48
2.2 Towards a definition of corpus linguistics
The use of corpus examples for illustrative purposes has become somewhat
fashionable among researchers who largely depend on introspective “data” oth-
erwise. While it is probably an improvement over the practice of simply invent-
ing data, it has a fundamental weakness: it does not ensure that the data selected
by the researcher are actually representative of the phenomenon under investiga-
tion. In other words, corpus-illustrated linguistics simply replaces introspectively
invented data with introspectively selected data and thus inherits the fallibility
of the introspective method discussed in the previous chapter.
Since overcoming the fallibility of introspective data is one of the central mo-
tivations for using corpora in the first place, the analysis of a given phenomenon
must not be based on a haphazard sample of instances that the researcher hap-
pened to notice while reading or, even worse, by searching the corpus for specific
examples. The whole point of constructing corpora as representative samples of
a language or variety is that they will yield representative samples of particular
linguistic phenomena in that language or variety. The best way to achieve this
is to draw a complete sample of the phenomenon in question, i.e. to retrieve all
instances of it from the corpus (issues of retrieval are discussed in detail in Chap-
ter 4). These instances must then be analyzed systematically, i.e., according to a
single set of criteria. This leads to the following definition (cf. Biber & Reppen
2015: 2, Cook 2003: 78):
49
2 What is corpus linguistics?
with the sentence “The main part of this book consists of a series of case studies
which involve the use of corpora and corpus analysis technology” (Partington
1998: 1), and another observes that “[c]orpus linguistics is [...] now inextricably
linked to the computer” (Kennedy 1998: 5); a third textbook explicitly includes the
“extensive use of computers for analysis, using both automatic and interactive
techniques” as one of four defining criteria of corpus linguistics Biber et al. (1998:
4). This perspective is summarized in the following definition:
However, the usefulness of this approach is limited. It is true that there are
scientific disciplines that are so heavily dependent upon a particular technol-
ogy that they could not exist without it – for example, radio astronomy (which
requires a radio telescope) or radiology (which requires an x-ray machine). How-
ever, even in such cases we would hardly want to claim that the technology in
question can serve as a defining criterion: one can use the same technology in
ways that do not qualify as belonging to the respective discipline. For example,
a spy might use a radio telescope to intercept enemy transmissions, and an engi-
neer may use an x-ray machine to detect fractures in a steel girder, but that does
not make the spy a radio astronomer or the engineer a radiologist.
Clearly, even a discipline that relies crucially on a particular technology cannot
be defined by the technology itself but by the uses to which it puts that technol-
ogy. If anything, we must thus replace the reference to corpus analysis software
by a reference to what that software typically does.
Software packages for corpus analysis vary in capability, but they all allow
us to search a corpus for a particular (set of) linguistic expression(s) (typically
word forms), by formulating a query using query languages of various degrees
of abstractness and complexity, and they all display the results (or hits) of that
query. Specifically, most of these software packages have the following functions:
1. they produce KWIC (Key Word In Context) concordances, i.e. they display
the hits for our query in their immediate context, defined in terms of a
particular number of words or characters to the left and the right (see Fig-
ure 2.6 for a KWIC concordance of the noun time) – they are often referred
to as concordancers because of this functionality;
2. they identify collocates of a given expression, i.e. word forms that occur in
a certain position relative to the hits; these words are typically listed in the
50
2.2 Towards a definition of corpus linguistics
order of frequency with which they occur in the position in question (see
Table 2.4 for a list of collocates of the noun time in a span of three words
to the left and right);
3. they produce frequency lists, i.e. lists of all character strings in a given
corpus listed in the order of their frequency of occurrence (see Table 2.5
for the forty most frequent strings (word forms and punctuation marks) in
the BNC Baby).
Note that concordancers differ with respect to their ability to deal with anno-
tation – there are few standards in annotation, especially in older corpora and
even the emerging XML-based standards, or wide-spread conventions like the
column format shown in Figure 2.5 above are not implemented in many of the
widely available software packages.
Let us briefly look at why the three functions listed above might be useful in
corpus linguistic research (we will discuss them in more detail in later chapters).
A concordance provides a quick overview of the typical usage of a particular
(set of) word forms or more complex linguistic expressions. The occurrences are
presented in random order in Figure 2.6, but corpus-linguistic software packages
typically allow the researcher to sort concordances in various ways, for example,
by the first word to the left or to the right; this will give us an even better idea
as to what the typical usage contexts for the expression under investigation are.
Collocate lists are a useful way of summarizing the contexts of a linguistic ex-
pression. For example, the collocate list in the column marked L1 in Table 2.4 will
show us at a glance what words typically directly precede the string time. The
determiners the and this are presumably due to the fact that we are dealing with
a noun, but the adjectives first, same, long, some, last, every and next are related
specifically to the meaning of the noun time; the high frequency of the prepo-
sitions at, by, for and in in the column marked L2 (two words to the left of the
node word time) not only gives us additional information about the meaning and
phraseology associated with the word time, it also tells us that time frequently
occurs in prepositional phrases in general.
Finally, frequency lists provide useful information about the distribution of
word forms (and, in the case of written language, punctuation marks) in a partic-
ular corpus. This can be useful, for example, in comparing the structural proper-
ties or typical contents of different language varieties (see further Chapter 10). It
is also useful in assessing which collocates of a particular word are frequent only
because they are frequent in the corpus in general, and which collocates actually
tell us something interesting about a particular word.
51
2 What is corpus linguistics?
Table 2.4: Collocates of time in a span of three words to the left and to
the right
L3 L2 L1 R1 R2 R3
for 335 the 851 the 1032 . 950 the 427 . 294
. 322 at 572 this 380 , 661 ? 168 , 255
at 292 a 361 first 320 to 351 i 141 the 212
, 227 all 226 of 242 of 258 . 137 to 120
a 170 . 196 same 240 and 223 and 118 a 118
the 130 by 192 a 239 for 190 you 104 it 112
it 121 , 162 long 224 in 184 it 102 was 107
to 100 of 154 some 200 i 177 a 96 and 92
and 89 for 148 last 180 he 136 he 92 i 86
in 89 it 117 every 134 ? 122 , 91 you 76
was 85 in 93 in 113 you 120 was 87 in 75
is 78 ’s 68 that 111 when 118 had 80 ? 71
’s 68 and 65 what 108 the 90 but 70 of 64
have 59 next 83 we 88 to 69 ’s 59
that 58 any 72 is 85 ? 64 is 59
had 55 one 65 as 78 she 58 ? 58
? 52 ’s 64 it 70 they 57 he 58
no 63 they 70 that 56 had 53
from 57 she 69 in 55
that 64
was 50
Table 2.5: The forty most frequent strings in the BNC Baby
52
2.2 Towards a definition of corpus linguistics
Figure 2.6: KWIC concordance (random sample) of the noun time (BNC
Baby)
Note, for example, that the collocate frequency lists on the right side of the
word time are more similar to the general frequency list than those on the left side,
suggesting that the noun time has a stronger influence on the words preceding
it than on the words following it (see further Chapter 7).
Given the widespread implementation of these three techniques, they are ob-
viously central to corpus linguistics research, so we might amend the definition
above as follows (a similar definition is implied by Kennedy (1998: 244–258)):
53
2 What is corpus linguistics?
Two problems remain with this definition. The first problem is that the re-
quirements of systematicity and completeness that were introduced in the sec-
ond definition are missing. This can be remedied by combining the second and
third definition as follows:
The second problem is that including a list of specific techniques in the defi-
nition of a discipline seems undesirable, no matter how central these techniques
are. First, such a list will necessarily be finite and will thus limit the imagination
of future researchers. Second, and more importantly, it presents the techniques
in question as an arbitrary set, while it would clearly be desirable to characterize
them in terms that capture the reasons for their central role in the discipline.
What concordances, collocate lists and frequency lists have in common is that
they are all ways of studying the distribution of linguistic elements in a corpus.
Thus, we could define corpus linguistics as follows:
On the one hand, this definition subsumes the previous two definitions: If we
assume that corpus linguistics is essentially the study of the distribution of lin-
guistic phenomena in a linguistic corpus, we immediately understand the central
role of the techniques described above: (i) KWIC concordances are a way of dis-
playing the distribution of an expression across different syntagmatic contexts;
(ii) collocation tables summarize the distribution of lexical items with respect to
other lexical items in quantitative terms, and (iii) frequency lists summarize the
overall quantitative distribution of lexical items in a given corpus.
On the other hand, the definition is not limited to these techniques but can be
applied open-endedly on all levels of language and to all kinds of distributions.
This definition is close to the understanding of corpus linguistics that this book
will advance, but it must still be narrowed down somewhat.
First, it must not be misunderstood to suggest that studying the distribution of
linguistic phenomena is an end in itself in corpus linguistics. Fillmore (1992: 35)
presents a caricature of a corpus linguist who is “busy determining the relative
54
2.2 Towards a definition of corpus linguistics
frequencies of the eleven parts of speech as the first word of a sentence versus as
the second word of a sentence”. Of course, there is nothing intrinsically wrong
with such a research project: when large electronically readable corpora and the
computing power to access them became available in the late 1950s, linguists
became aware of a vast range of stochastic regularities of natural languages that
had previously been difficult or impossible to detect and that are certainly worthy
of study. Narrowing our definition to this stochastic perspective would give us
the following:
This is a fairly accurate definition, in the sense that it describes the actual
practice of a large body of corpus-linguistic research in a way that distinguishes
it from similar kinds of research. It is not suitable as a final characterization of
corpus linguistics yet, as the phrase “distribution of linguistic phenomena” is still
somewhat vague. The next section will explicate this phrase.
55
2 What is corpus linguistics?
The remainder of Part I of this book will expand this definition into a guideline
for conducting corpus linguistic research. The following is a brief overview.
56
2.3 Corpus linguistics as a scientific method
Any scientific research project begins, obviously, with the choice of an object
of research – some fragment of reality that we wish to investigate –, and a re-
search question – something about this fragment of reality that we would like to
know.
Since reality does not come pre-packaged and labeled, the first step in formu-
lating the research question involves describing the object of research in terms
of constructs – theoretical concepts corresponding to those aspects of reality that
we plan to include. These concepts will be provided in part by the state of the art
in our field of research, including, but not limited to, the specific model(s) that
we may choose to work with. More often than not, however, our models will
not provide fully explicated constructs for the description of every aspect of the
object of research. In this case, we must provide such explications.
In corpus linguistics, the object of research will usually involve one or more
aspects of language structure or language use, but it may also involve aspects
of our psychological, social or cultural reality that are merely reflected in lan-
guage (a point we will return to in some of the case studies presented in Part
II of this book). In addition, the object of research may involve one or more as-
pects of extralinguistic reality, most importantly demographic properties of the
speaker(s) such as geographical location, sex, age, ethnicity, social status, finan-
cial background, education, knowledge of other languages, etc. None of these
phenomena are difficult to characterize meaningfully as long as we are doing so
in very broad terms, but none of them have generally agreed-upon definitions
either, and no single theoretical framework will provide a coherent model en-
compassing all of them. It is up to the researcher to provide such definitions and
to justify them in the context of a specific research question.
Once the object of research is properly delineated and explicated, the second
step is to state our research question in terms of our constructs. This always
involves a relationship between at least two theoretical constructs: one construct,
whose properties we want to explain (the explicandum), and one construct that
we believe might provide the explanation (the explicans). In corpus linguistics,
the explicandum is typically some aspect of language structure and/or use, while
the explicans may be some other aspect of language structure or use (such as the
presence or absence of a particular linguistic element, a particular position in a
discourse, etc.), or some language external factor (such as the speaker’s sex or
age, the relationship between speaker and hearer, etc.).
In empirical research, the explicandum is referred to as the dependent variable
and the explicans as the independent variable – note that these terms are actually
quite transparent: if we want to explain X in terms of Y, then X must be (po-
tentially) dependent on Y. Each of the variables must have at least two possible
57
2 What is corpus linguistics?
values. In the simplest case, these values could be the presence vs. the absence of
instances of the construct, in more complex cases, the values would correspond
to different (classes of) instances of the construct. In the example above, the de-
pendent variable is Word for the Forward-Facing Window of a Car with
the values windshield and windscreen; the independent variable is Variety of
English with the values british and american (from now on, variables will be
typographically represented by small caps with capitalization, their values will
be represented by all small caps).5 The formulation of research questions will be
discussed in detail in Chapter 3, Section 3.1.
The third step in a research project is to derive a testable prediction from the
hypothesis. Crucially, this involves defining our constructs in a way that allows
us to measure them, i.e., to identify them reliably in our data. This process, which
is referred to as operationalization, is far from trivial, since even well-defined and
agreed-upon aspects of language structure or use cannot be straightforwardly
read off the data. We will return to operationalization in detail in Chapter 3, Sec-
tion 3.2.
The fourth step consists in collecting data – in the case of corpus linguistics,
in retrieving them from a corpus. Thus, we must formulate one or more queries
that will retrieve all (or a representative sample of) cases of the phenomenon
under investigation. Once retrieved, the data must, in a fifth step, be categorized
according to the values of the variables involved. In the context of corpus linguis-
tics, this means annotating them according to an annotation scheme containing
the operational definitions. Retrieval and annotation are discussed in detail in
Chapter 4.
The fifth and final step of a research project consists in evaluating the data
with respect to our prediction. Note that in the simple example presented here,
the conditional distribution is a matter of all-or-nothing: all instances of wind-
screen occur in the British part of the corpus and all instances of windshield occur
in the American part. There is a categorical difference between the two words
with respect to the conditions under which they occur (at least in our corpora). In
5
Some additional examples may help to grasp the notion of variables and values. For example,
the variable Interruption has two values, presence (an interruption occurs) vs. absence,
(no interruption occurs). The variable Sex, in lay terms, also has two values (male vs. female).
In contrast, the value of the variable Gender is language dependent: in French or Spanish it
has two values (masculine vs. feminine), in German or Russian it has three (masculine vs.
feminine vs. neuter) and there are languages with even more values for this variable. The
variable Voice has two to four values in English, depending on the way that this construct
is defined in a given model (most models of English would see active and passive as values
of the variable Voice, some models would also include the middle construction, and a few
models might even include the antipassive).
58
2.3 Corpus linguistics as a scientific method
contrast, the two words do not differ at all with respect to the grammatical con-
texts in which they occur. The evaluation of such cases is discussed in Chapter 3,
Section 3.1.2.
Categorical distributions are only the limiting case of a quantitative distribu-
tion: two (or more) words (or other linguistic phenomena) may also show relative
differences in their distribution across conditions. For example, the words rail-
way and railroad show clear differences in their distribution across the combined
corpus used above: railway occurs 118 times in the British part compared to only
16 times in the American part, while railroad occurs 96 times in the American
part but only 3 times in the British part. Intuitively, this tells us something very
similar about the words in question: they also seem to be dialectal variants, even
though the difference between the dialects is gradual rather than absolute in
this case. Given that very little is absolute when it comes to human behavior,
it will come as no surprise that gradual differences in distribution will turn out
to be much more common in language (and thus, more important to linguistic
research) than absolute differences. Chapters 5 and 6 will discuss in detail how
such cases can be dealt with. For now, note that both categorical and relative
conditional distributions are covered by the final version of our definition.
Note also that many of the aspects that were proposed as defining criteria in
previous definitions need no longer be included once we adopt our final version,
since they are presupposed by this definition: conditional distributions (whether
they differ in relative or absolute terms) are only meaningful if they are based
on the complete data base (hence the criterion of completeness); conditional dis-
tributions can only be assessed if the data are carefully categorized according to
the relevant conditions (hence the criterion of systematicity); distributions (espe-
cially relative ones) are more reliable if they are based on a large data set (hence
the preference for large electronically stored corpora that are accessed via ap-
propriate software applications); and often – but not always – the standard pro-
cedures for accessing corpora (concordances, collocate lists, frequency lists) are a
natural step towards identifying the relevant distributions in the first place. How-
ever, these preconditions are not self-serving, and hence they cannot themselves
form the defining basis of a methodological framework: they are only motivated
by the definition just given.
Finally, note that our final definition does distinguish corpus linguistics from
other kinds of observational methods, such as text linguistics, discourse analysis,
variationist sociolinguistics, etc., but it does so in a way that allows us to rec-
ognize the overlaps between these methods. This is highly desirable given that
these methods are fundamentally based on the same assumptions as to how lan-
guage can and should be studied (namely on the basis of authentic instances of
language use), and that they are likely to face similar methodological problems.
59
3 Corpus linguistics as a scientific
method
At the end of the previous chapter, we defined corpus linguistics as “the investi-
gation of linguistic research questions that have been framed in terms of the con-
ditional distribution of linguistic phenomena in a linguistic corpus” and briefly
discussed the individual steps necessary to conduct research on the basis of this
discussion.
In this chapter, we will look in more detail at the logic and practice of formulat-
ing and testing research questions (Sections 3.1.1 and 3.1.2). We will then discuss
the notion of operationalization in some detail (Section 3.2) before closing with
some general remarks about the place of hypothesis testing in scientific research
practice (Section 3.3).
time frame (think of the current buzz word “big data”). Such massive amounts of
data allow us to take an extremely inductive approach – essentially just asking
“What relationships exist in my data?” – and still arrive at reliable generalizations.
Of course, matters are somewhat more complex, since, as discussed at the end
of the previous chapter, theoretical constructs cannot directly be read off our
data. But the fact remains that, used in the right way, inductive research designs
have their applications. In corpus linguistics, large amounts of data have been
available for some time (as mentioned in the previous chapter, the size even of
corpora striving for some kind of balance is approaching half-a-billion words),
and inductive approaches are used routinely and with insightful consequences
(Sinclair 1991 is an excellent example).
The second way of stating research questions entails a more focused way of
approaching our data. We state our hypothesis before looking at any data, and
then limit our observations just to those that will help us determine the truth of
this hypothesis (which is far from trivial, as we will see presently). This so-called
deductive approach is generally seen as the standard way of conducting research
(at least ideally – actual research by actual people tends to be a bit messier even
conceptually).
We will generally take a deductive approach in this book, but it will frequently
include inductive (exploratory) excursions, as induction is a often useful in it-
self (for example, in situations where we do not know enough to state a useful
working hypothesis or where our aim is mainly descriptive) or in the context
of deductive research (where a first exploratory phase might involve inductive
research as a way of generating hypotheses). We will see elements of inductive
research in some of the case studies in Part II of this book.
(1) The English language has a word for the forward-facing window of a car.
Let us assume, for the moment, that we agree on the existence of something
called car that has something accurately and unambiguously described by ‘for-
ward-facing window’, and that we agree on the meaning of “English” and “lan-
guage X has a word for Y”. How could we prove the statement in (1) to be true?
62
3.1 The scientific hypothesis
There is only one way: we have to find the word in question. We could, for exam-
ple, describe the concept Forward-Facing Window of Car to a native speaker
or show them a picture of one, and ask them what it is called (a method used in
traditional dialectology and field linguistics). Or we could search a corpus for all
passages mentioning cars and hope that one of them mentions the forward-fac-
ing window; alternatively, we could search for grammatical contexts in which
we might expect the word to be used, such as ⟨ through the NOUN of POSS.PRON
car ⟩ (see Section 4.1 in Chapter 4 on how such a query would have to be con-
structed). Or we could check whether other people have already found the word,
for example by searching the definitions of an electronic dictionary. If we find a
word referring to the forward-facing window of a car, we have thereby proven
its existence – we have verified the statement in (1).
But how could we falsify the statment, i.e., how could we prove that English
does not have a word for the forward-facing window of a car? The answer is sim-
ple: we can’t. As discussed extensively in Chapter 1, both native-speaker knowl-
edge and corpora are necessarily finite. Thus, if we ask a speaker to tell us what
the forward-facing window of car is called and they don’t know, this may be be-
cause there is no such word, or because they do not know this word (for example,
because they are deeply uninterested in cars). If we do not find a word in our cor-
pus, this may be because there is no such word in English, or because the word
just happens to be absent from our corpus, or because it does occur in the corpus
but we missed it. If we do not find a word in our dictionary, this may be because
there is no such word, or because the dictionary-makers failed to include it, or
because we missed it (for example, because the definition is phrased so oddly
that we did not think to look for it – as in the Oxford English Dictionary, which
defines windscreen somewhat quaintly as “a screen for protection from the wind,
now esp. in front of the driver’s seat on a motor-car” (OED, sv. windscreen)). No
matter how extensively we have searched for something (e.g. a word for a partic-
ular concept), the fact that we have not found it does not mean that it does not
exist.
The statement in (1) is a so-called “existential statement” (it could be rephrased
as “There exists at least one x such that x is a word of English and x refers to
the forward-facing window of a car”). Existential statements can (potentially) be
verified, but they can never be falsified. Their verifiability depends on a crucial
condition hinted at above: that all words used in the statement refer to entities
that actually exist and that we agree on what these entities are. Put simply, the
statement in (1) rests on a number of additional existential statements, such as
63
3 Corpus linguistics as a scientific method
“Languages exist”, “Words exist”, “At least one language has words”, “Words refer
to things”, “English is a language”, etc.
There are research questions that take the form of existential statements. For
example, in 2016 the astronomers Konstantin Batygin and Michael E. Brown pro-
posed the existence of a ninth planet (tenth, if you cannot let go of Pluto) in our
solar system (Batygin & Brown 2016). The existence of such a planet would ex-
plain certain apparent irregularities in the orbits of Kuiper belt objects, so the
hypothesis is not without foundation and may well turn out to be true. However,
until someone actually finds this planet, we have no reason to believe or not to
believe that such a planet exists (the irregularities that Planet Nine is supposed
to account for have other possible explanations, cf., e.g. Shankman et al. 2017).
Essentially, its existence is an article of faith, something that should clearly be
avoided in science.1
Nevertheless, existential statements play a crucial role in scientific enquiry –
note that we make existential statements every time we postulate and define a
construct. As pointed out above, the statement in (1) rests, for example, on the
statement “Words exist”. This is an existential statement, whose precise content
depends on how our model defines words. One frequently-proposed definition
is that words are “the smallest units that can form an utterance on their own”
(Matthews 2014: 436), so “Words exist” could be rephrased as “There is at least
one x such that x can form an utterance on its own” (which assumes an additional
existential statement defining utterance, and so on). In other words, scientific
enquiry rests on a large number of existential statements that are themselves
rarely questioned as long as they are useful in postulating meaningful hypotheses
about our research objects.
But if scientific hypotheses are not (or only rarely) existential statements, what
are they instead? As indicated at the end of the previous and the beginning of
the current chapter, they are statements postulating relationships between con-
structs, rather than their existence. The minimal model within which such a hy-
pothesis can be stated is visualized schematically in the cross table (or contingency
table) in Table 3.1.
There must be (at least) two constructs, one of which we want to explain (the
dependent variable), and one which we believe provides an explanation (the in-
dependent variable). Each variable has (at least) two values. The dimensions of
1
Which is not to say that existential statements in science cannot lead to a happy ending –
consider the case of the so-called Higgs boson, a particle with a mass of 125.09 GeV/c2 and a
charge and spin of 0, first proposed by the physicist Peter Higgs and five colleagues in 1964. In
2012, two experiments at the Large Hadron Collider in Geneva finally measured such a particle,
thus verifying this hypothesis.
64
3.1 The scientific hypothesis
Dependent Var.
value 1 value 2
Independent Var. value 1 IV1 ∩ DV1 IV1 ∩ DV2
value 2 IV2 ∩ DV1 IV2 ∩ DV2
the table represent the variables (with a loose convention to show the values
of the independent variable in the table rows and the values of the dependent
variables in the table columns, the cells represent all possible intersections (i.e.,
combinations) of their values (these are represented here, and on occasion in the
remainder of the book, by the symbol ∩)).
The simplest cases of such hypotheses (in Popper’s view, the only legitimate
case) are so-called universal statements. A text-book example of such a statement
is All swans are white (Popper 1959), where the two constructs are Animal, with
the values swan and non-swan and Color, with the values white and non-
white. The hypothesis All swans are white amounts to the prediction that the
intersection swan ∩ white exists, while the intersection swan ∩ non-white
does not exist – it makes no predictions about the other two intersections.
Our speculation concerning the distribution of the words windscreen and wind-
shield, discussed in the previous chapter, essentially consists of the two universal
statements, given in (2) and (3):
Note that the statements in (2) and (3) could be true or false independently of
each other (and note also that we are assuming a rather simple model of English,
with British and American English as the only varieties).
How would we test (either one or both of) these hypotheses? Naively, we might
attempt to verify them, as we would in the case of existential statements. This
attempt would be doomed, however, as Popper (1963) forcefully argues.
If we treat the statements in (2) and (3) analogously to the existential statement
in (1), we might be tempted to look for positive evidence only, i.e., for evidence
65
3 Corpus linguistics as a scientific method
that appears to support the claim. For example, we might search a corpus of
British English for instances of windscreen and a corpus of American English for
instances of windshield. As mentioned at the end of the previous chapter, the
corresponding quieries will indeed turn up cases of windscreen in British English
and of windshield in American English.
If we were dealing with existential statements, this would be a plausible strat-
egy and the results would tell us, that the respective words exist in the respective
variety. However, with respect to the universal statements in (2) and (3), the re-
sults tell us nothing. Consider Table 3.2, which is a visual representation of the
hypotheses in (2) and (3).
Table 3.2: A contingency table with binary values for the intersections
What we would have looked for in our naive attempt to verify our hypotheses
are only those cases that should exist (i.e., the intersections indicated by check-
marks in Table 3.2). But if we find such examples, this does not tell us anything
with respect to (2) and (3): we would get the same result if both words occur in
both varieties. As Popper puts it, “[i]t is easy to obtain confirmations, or verifi-
cations, for nearly every theory [i.e., hypothesis, A.S.] – if we look for confirma-
tions” (Popper 1963: 36).
Obviously, we also have to look for those cases that should not exist (i.e., the
intersections indicated by crosses in Table 3.2): the prediction derived from (2)
and (3) is that windscreen should occur exclusively in British English corpora and
that windshield should occur exclusively in American English corpora.
Even if we approach our data less naively and find that our data conform fully
to the hypothesized distribution in Table 3.2, there are two reasons why this does
not count as verification.
First, the distribution could be due to some difference between the corpora
other than the dialectal varieties they represent – it could, for example, be due to
stylistic preferences of the authors, or the house styles of the publishing houses
whose texts are included in the corpora. There are, after all, only a handful of
texts in LOB and BROWN that mention either of the two words at all (three in
each corpus).
66
3.1 The scientific hypothesis
67
3 Corpus linguistics as a scientific method
result of “a serious but unsuccessful attempt to falsify the theory” (Popper 1963:
36).
In our example, we would have to take the largest corpora of British and Amer-
ican English we can find and search them for counterexamples to our hypothesis
(i.e., the intersections marked by crosses in Table 3.2). As long as we do not find
them (and as long as we find corroborating evidence in the process), we are jus-
tified in assuming a dialectal difference, but we are never justified in claiming to
have proven such a difference. Incidentally, we do indeed find such counterexam-
ples in this case if we increase our samples: The 100-million word British National
Corpus contains 33 cases of the word windshield (as opposed to 451 cases of wind-
screen), though some of them refer to forward-facing windows of aircraft rather
than cars; conversely the 450-million-word Corpus of Current American English
contains 205 cases of windscreen (as opposed to 2909 cases of windshield).
68
3.1 The scientific hypothesis
English and gasoline in American English. A search in the four corpora used
above yields the frequencies of occurrence shown in Table 3.3.
Table 3.3: Petrol vs. gasoline
Distilled Petroleum
petrol gas
Variety british 21 0
american 1 20
In other words, the distribution is almost identical to that for the words wind-
screen and windshield – except for one counterexample, where petrol occurs in
the American part of the corpus (specifically, in the FROWN corpus). In other
words, it seems that our hypothesis is falsified at least with respect to the word
petrol. Of course, this is true only if we are genuinely dealing with a counterex-
ample, so let us take a closer look at the example in question, which turns out to
be from the novel Eye of the Storm by Jack Higgins:
(4) He was in Dorking within half an hour. He passed straight through and
continued toward Horsham, finally pulling into a petrol station about five
miles outside. (Higgins, Eye of the Storm)
Now, Jack Higgins is a pseudonym used by the novelist Harry Patterson for
some of his novels – and Patterson is British (he was born in Newcastle upon
Tyne and grew up in Belfast and Leeds). In other words, his novel was erro-
neously included in the FROWN corpus, presumably because it was published
by an American publisher. Thus, we can discount the counterexample and main-
tain our original hypothesis. Misclassified data are only one reason to discount
a counterexample, other reasons include intentional deviant linguistic behavior
(for example, an American speaker may imitate a British speaker or a British
speaker may have picked up some American vocabulary on a visit to the United
States); a more complex reason is discussed below.
Note that there are two problems with the strategy of checking counterex-
amples individually to determine whether they are genuine counterexample or
not. First, we only checked the example that looked like a counterexample – we
did not check all the examples that fit our hypothesis. However, these examples
could, of course, also contain cases of misclassified data, which would lead to ad-
ditional counterexamples. Of course, we could theoretically check all examples,
69
3 Corpus linguistics as a scientific method
as there are only 42 examples overall. However, the larger our corpus is (and
most corpus-linguistic research requires corpora that are much larger than the
four million words used here), the less feasible it becomes to do so.
The second problem is that we were lucky, in this case, that the counterex-
ample came from a novel by a well-known author, whose biographical informa-
tion is easily available. But linguistic corpora do not (and cannot) contain only
well-known authors, and so checking the individual demographic data for ev-
ery speaker in a corpus may be difficult to impossible. Finally, some language
varieties cannot be attributed to a single speaker at all – political speeches are
often written by a team of speech writers that may or may not include the per-
son delivering the speech, newspaper articles may include text from a number of
journalists and press agencies, published texts in general are typically proof-read
by people other than the author, and so forth.
Let us look at a more complex example, the words for the (typically elevated)
paved path at the side of a road provided for pedestrians. Dictionaries typically
tell us, that this is called pavement in British English and sidewalk in American
English, for example, the OALD:
(5) a. pavement noun [...]
1 [countable] (British English) (North American English sidewalk) a flat
part at the side of a road for people to walk on [OALD]
b. sidewalk noun [...]
(North American English) (British English pavement) a flat part at the
side of a road for people to walk on [OALD]
A query for the two words (in all their potential morphological and orthogra-
phic variants) against the LOB and FLOB corpora (British English) and BROWN
and FROWN corpora (American English) yields the results shown in Table 3.4.
Table 3.4: Pavement vs. sidewalk
In this case, we are not dealing with a single counterexample. Instead, there
are four apparent counterexamples where sidewalk occurs in British English, and
22 apparent counterexamples where pavement occurs in American English.
70
3.1 The scientific hypothesis
(6) a. One persistent taxi follows him through the street, crawling by the
sidewalk...
(LOB E09: Wilfrid T. F. Castle, Stamps of Lebanon’s Dog River)
b. “Keep that black devil away from Rusty or you’ll have a sick horse on
your hands,” he warned, and leaped to the wooden sidewalk.
(LOB N07: Bert Cloos, Drury)
c. There was a small boy on the sidewalk selling melons.
(FLOB K24: Linda Waterman, Bad Connection.)
d. Joe, my love, the snowflakes fell on the sidewalk.
(FLOB K25: Christine McNeill, The Lesson.)
Not much can be found about Wilfrid T.F. (Thomas Froggatt) Castle, other than
that he wrote several books about postal stamps and about history, including the
history of English parish churches, all published by British publishers. There is
a deceased estate notice under the name Wilfrid Thomas Froggatt Castle that
gives his last address in Somerset (The Stationery Office 1999). If this is the same
person, it seems likely that he was British and that (6a) is a genuinely British
English use of sidewalk.
Bert Cloos is the author of a handful of western novels with titles like Sangre
India, Skirmish and Injun Blood. Again, very little can be found out about him, but
he is mentioned in the Los Angeles Times from May 2, 1963 (p. 38), which refers
to him as “Bert Cloos of Encinitas”. Since Encinitas is in California, Bert Cloos
may, in fact, be an American author who ended up in the LOB by mistake – but,
of course, Brits may also live in California, so there is no way of determining this.
Clearly, though, the novels in question are all set in the US, so whether Cloos is
American or not, he is presumably using American English in (6b) above.
For the authors of (6c, d), Linda Waterman and Christine McNeill, no biograph-
ical information can be found at all. Waterman’s story was published in a British
student magazine, but this in itself is no evidence of anything. The story is set
in Latin America, so there may be a conscious effort to evoke American English.
71
3 Corpus linguistics as a scientific method
In McNeill’s case there is some evidence that she is British: she uses some words
that are typically British, such as dressing gown (AmE (bath)robe) and breadbin
(AmE breadbox), so it is plausible that she is British. Like Waterman’s story, hers
was published in a British magazine. Interestingly, however, the scene in which
the word is used is set in the United States, so she, too, might be consciously evok-
ing American English. To sum up, we have one example that was likely produced
by an American speaker, and three that were likely produced by British speakers,
although two of these were probably evoking American English. Which of these
examples we may safely discount, however, remains difficult to say.
Turning to pavement in American English, it would be possible to check the
origin of the speakers of all 22 cases with the same attention to detail, but it
is questionable that the results would be worth the time invested: as pointed
out, it is unlikely that there are so many misclassified examples in the American
corpora.
On closer inspection, however, it becomes apparent that we may be dealing
with a different type of exception here: the word pavement has additional senses
to the one cited in (5a) above, one of which does exist in American English. Here
is the remainder of the relevant dictionary entry:
Since neither of these meanings is relevant for the issue of British and Amer-
ican words for pedestrian paths next to a road, they cannot be treated as coun-
terexamples in our context. In other words, we have to look at all hits for pave-
ment and annotate them for their appropriate meaning. This in itself is a non-
trivial task, which we will discuss in more detail in Chapters 4 and 5. Take the
example in (8):
(8) [H]e could see the police radio car as he rounded the corner and slammed
on the brakes. He did not bother with his radio – there would be time for
that later – but as he scrambled out on the pavement he saw the filling
station and the public telephone booth ... (BROWN L 18)
Even with quite a large context, this example is compatible with a reading
of pavement as ‘road surface’ or as ‘pedestrian path’. If it came from a British
text, we would not hesitate to assign the latter reading, but since it comes from
an American text (the novel Error of Judgment by the American author George
Harmon Coxe), we might lean towards erring on the side of caution and annotate
72
3.1 The scientific hypothesis
it as ‘road surface’. Alas, the side of “caution” here is the side suggested by the
very hypothesis we are trying to falsify – we would be basing our categorization
circularly on what we are expecting to find in the data.
A more intensive search of novels by American authors in the Google Books
archive (which is larger than the BROWN corpus by many orders of magnitude),
turns up clear cases of the word pavement with the meaning of sidewalk, for
example, this passage from a novel by American author Mary Roberts Rinehart:
(9) He had fallen asleep in his buggy, and had wakened to find old Nettie draw-
ing him slowly down the main street of the town, pursuing an erratic but
homeward course, while the people on the pavements watched and smiled.
(Mary Roberts Rinehart, The Breaking Point, Ch. 10)
Since this reading exists, then, we have found a counterexample to our hypoth-
esis and can reject it.
But what does this mean for our data from the BROWN corpus – is there really
nothing to be learned from this sample concerning our hypothesis? Let us say
we truly wanted to err on the side of caution, i.e. on the side that goes against
our hypothesis, and assign the meaning of sidewalk to Coxe’s novel too. Let us
further assume that we can assign all other uses of pavement in the sample to
the reading ‘paved surface’, and that two of the four examples of sidewalk in
the British English corpus are genuine counterexamples. This would give us the
distribution shown in Table 3.5.
Table 3.5: Pavement vs. sidewalk (corrected)
Given this distribution, would we really want to claim that it is wrong to assign
pavement to British and sidewalk to American English on the basis that there are
a few possible counterexamples? More generally, is falsification by counterexam-
ple a plausible research strategy for corpus linguistics?
There are several reasons why the answer to this question must be “no”. First,
we can rarely say with any certainty whether we are dealing with true coun-
terexamples or whether the apparent counterexamples are due to errors in the
73
3 Corpus linguistics as a scientific method
(10) a. We must not be rattled into surrender, but we must not – and I am not
– be afraid of negotiation. (LOB A05)
b. We must not be rattled into surrender, but we must not be – and I am
not – afraid of negotiation. (Macmillan 1961)
There is what seems to be an agreement error in (10a), that is due to the fact
that the appositional and I am not is inserted before the auxiliary be, leading to
the ungrammatical am not be. But how do we know it is ungrammatical, since it
occurs in a corpus? In this case, we are in luck, because the example is quoted
from a speech by the former British Prime Minister Harold Macmillan, and the
original transcript shows that he actually said (10b). But not every speaker in a
corpus is a prime minister, just as not every speaker is a well-known author, so it
will not usually be possible to get independent evidence for a particular example.
Take (11), which represents a slightly more widespread agreement “error”:
(11) It is, however, reported that the tariff on textiles and cars imported from
the Common Market are to be reduced by 10 percent. (LOB A15)
Here, the auxiliary be should agree with its singular subject tarrif, but instead,
the plural form occurs. There is no way to find out who wrote it and whether
they intended to use the singular form but were confused by the embedded plural
NP textiles and cars (a likely explanation). Thus, we would have to discard it
based on our intuition that it constitutes an error (the LOB creators actually mark
it as such, but I have argued at length in Chapter 1 why this would defeat the
point of using a corpus in the first place), or we would have to accept it as a
counterexample to the generalization that singular subjects take singular verbs
(which we are unlikely to want to give up based on a single example).
In theoretical terms, this may not be a definitive argument against the idea
of falsification by counterexample. We could argue that we simply have to make
sure that there are no errors in the construction of our corpus and that we have to
classify all hits correctly as constituting a genuine counterexample or not. How-
ever, in actual practice this is impossible. We can (and must) try to minimize
errors in our data and our classification, but we can never get rid of them com-
pletely (this is true not only in corpus-linguistics but in any discipline).
74
3.1 The scientific hypothesis
Second, even if our data and our classification were error-free, human behav-
ior is less deterministic than the physical processes Popper had in mind when
he elevated counterexamples to the sole acceptable evidence in science. Even
in a simple case like word choice, there may be many reasons why a speaker
may produce an exceptional utterance – evoking a variety other than their own
(as in the examples above), unintentionally or intentionally using a word that
they would not normally use because their interlocutor has used it, temporarily
slipping into a variety that they used to speak as a child but no longer do, etc.
With more complex linguistic behavior, such as producing particular grammati-
cal structures, there will be additional reasons for exceptional behavior: planning
errors, choosing a different formulation in mid-sentence, tiredness, etc. – all the
kinds of things classified as performance errors in traditional grammatical the-
ory.
In other words, our measurements will never be perfect and speakers will
never behave perfectly consistently. This means that we cannot use a single
counterexample (or even a handful of counterexamples) as a basis for rejecting
a hypothesis, even if that hypothesis is stated in terms of a universal statement.
However, as pointed out above, many (if not most) hypotheses in corpus lin-
guistics do not take the form of universal statements (“All X’s are Y”, “Z’s always
do Y”, etc.), but in terms of tendencies or preferences (“X’s tend to be Y”, “Z’s pre-
fer Y”, etc.). For example, there are a number of prepositions and/or adverbs in
English that contain the morpheme -ward or -wards, such as afterward(s), back-
ward(s), downward(s), inward(s), outward(s) and toward(s). These two morphemes
are essentially allomorphs of a single suffix that are in free variation: they have
the same etymology (-wards simply includes a lexicalized genitive ending), they
have both existed throughout the recorded history of English and there is no dis-
cernible difference in meaning between them. However, many dictionaries claim
that the forms ending in -s are preferred in British English and the ones without
the -s are preferred in American English.
We can turn this claim into a hypothesis involving two variables (Variety and
Suffix Variant), but not one of the type “All x are y”. Instead, we would have
to state it along the lines of (12) and (13):
75
3 Corpus linguistics as a scientific method
(14) a. [T]he tall young buffalo hunter pushed open the swing doors and
walked towards the bar. (BROWN N)
b. Then Angelina turned and with an easy grace walked toward the
kitchen. (BROWN K)
Suffix Variant
-ward -wards
Variety british ∘ ○
american ○ ∘
76
3.2 Operationalization
3.2 Operationalization
The discussion so far has shown some of the practical challenges posed even by
a simple construct like Variety with seemingly obvious values such as british
and american. However, there is a more fundamental, and more challenging
issue to consider: As hinted at in Section 3.1.1, we are essentially making (sets
of) existential statements when we postulate such constructs. All examples dis-
cussed above simply assumed the existence of something called “British English”
and “American English”, concepts that in turn presuppose the existence of some-
thing called “English” and of the properties “British” and “American”. But if we
claim the existence of these constructs, we must define them; what is more, we
must define them in a way that enables us (and others) to find them in the real
world (in our case, in samples of language use). We must provide what is referred
to as operational definitions.
(15) 1 firm to touch firm, stiff, and difficult to press down, break, or cut [≠
soft] (LDCE, s.v. hard, cf. also the virtually identical definitions in CALD,
MW and OALD)
77
3 Corpus linguistics as a scientific method
78
3.2 Operationalization
79
3 Corpus linguistics as a scientific method
the DSM-IV but places more emphasis on (and is more specific with respect to)
mental symptoms and less emphasis on social behaviors.
As should have become clear, operational definitions do not (and do not at-
tempt to) capture the “essence” of the things or phenomena they define. We can-
not say that the Vickers Hardness number “is” hardness or that the DSM-IV list
of symptoms “is” schizophrenia. They are simply ways of measuring or diagnos-
ing these phenomena. Consequently, it is pointless to ask whether operational
definitions are “correct” or “incorrect” – they are simply useful in a particular
context. However, this does not mean that any operational definition is as good
as any other. A good operational definition must have two properties: it must be
reliable and valid.
A definition is reliable to the degree that different researchers can use it at dif-
ferent times and all get the same results; this objectivity (or at least intersubjec-
tivity) is one of the primary motivations for operationalization in the first place.
Obviously, the reliability of operational definitions will vary depending on the
degree of subjective judgment involved: while Vickers Hardness is extremely reli-
able, depending only on whether the apparatus is in good working order and the
procedure is followed correctly, the DSM-IV definition of schizophrenia is much
less reliable, depending, to some extent irreducibly, on the opinions and experi-
ence of the person applying it. Especially in the latter case it is important to test
the reliability of an operational definition empirically, i.e. to let different people
apply it and see to what extent they get the same results (see further Chapter 4).
A definition is valid to the degree that it actually measures what it is supposed
to measure. Thus, we assume that there are such phenomena as “hardness” or
“schizophrenia” and that they may be more or less accurately captured by an
operational definition. Validity is clearly a very problematic concept: since phe-
nomena can only be measured by operational definitions, it would be circular to
assess the quality of the same definitions on the basis of these measures. One
indirect indication of validity is consistency (e.g., the phenomena identified by
the definition share a number of additional properties not mentioned in the def-
inition), but to a large extent, the validity of operationalizations is likely to be
assessed on the basis of plausibility arguments. The more complex and the less
directly accessible a construct is, the more problematic the concept of validity
becomes: While everyone would agree that there is such a thing as Hardness,
this is much less clear in the case of Schizophrenia: it is not unusual for psy-
chiatric diagnoses to be reclassified (for example, what was Asperger’s syndrome
in the DSM-IV became part of autism spectrum disorder in the DSM-V) or to be
dropped altogether (as was the case with homosexuality, which was treated as
a mental disorder by the DSM-II until 1974). Thus, operational definitions may
80
3.2 Operationalization
create the construct they are merely meant to measure; it is therefore important
to keep in mind that even a construct that has been operationally defined is still
just a construct, i.e. part of a theory of reality rather than part of reality itself.
3. operational definitions and the procedure by which they have been applied
may be explicitly stated.
81
3 Corpus linguistics as a scientific method
(18) a. We are a women’s college, one of only 46 women’s colleges in the United
States and Canada (womenscollege.du.edu)
b. That wasn’t too far from Fifth Street, and should allow him to make
Scotty’s Bar by midnight. (BROWN L05)
c. My Opera was the virtual community for Opera web browser users.
(Wikipedia, s.v. My Opera)
d. ‘Oh my God!’ she heard Mike mutter under his breath, and she laughed
at his discomfort. (BNC HGM)
e. The following day she caught an early train from King’s Cross station
and set off on the two-hundred-mile journey north. (BNC JXT)
f. The true tack traveller would spend his/her honeymoon in a motel, on
a heart-shaped water bed. (BNC AAV)
While all of these cases have the form of the possessive construction and match
the strings above, opinions may differ on whether they should be included in a
sample of English possessive constructions. Example (18a) is a so-called posses-
sive compound, a lexicalized possessive construction that functions like a conven-
tional compound and could be treated as a single word. In examples (18b and c),
the possessive construction is a proper name. Concerning the latter: if we want
to include it, we would have to decide whether also to include proper names
where possessive pronoun and noun are spelled as a single word, as in MySpace
(the name of an online social network now lost in history). Example (18d) is sim-
ilar in that my God is used almost like a proper name; in addition, it is part of a
fixed phrase. Example (18e) is a geographical name; here, the problem is that such
names are increasingly spelled without an apostrophe, often by conscious deci-
sions by government institutions (see Swaine 2009; Newman 2013). If we want
to include them, we have to decide whether also to include spellings without the
apostrophe (such as 19), and how to find them in the corpus:
82
3.2 Operationalization
(19) His mother’s luggage had arrived at Kings Cross Station in London, and of
course nobody collected it. (BNC H9U)
83
3 Corpus linguistics as a scientific method
84
3.2 Operationalization
in these tags indicates that additional letters may follow to distinguish subcat-
egories, such as tense. Again, the corpora seem to recognize the same subcate-
gories: for example, third person forms are signaled by a Z in BROWN and the
BNC.
In other cases, the categories themselves differ. For example, in BROWN, all
prepositions are labeled IN, while the BNC distinguishes of from other preposi-
tions by labeling the former PRF and the latter PRP; FLOB has a special tag for
the preposition for, IF; LOB labels all coordinating conjunctions CC, FLOB has a
special tag for BUT, CCB. More drastically, LOB and FLOB treat some sequences
of orthographic words as multi-word tokens belonging to a single word class:
in front of is treated as a preposition in LOB and FLOB, indicated by labeling
all three words IN (LOB) and II (FLOB), with an additional indication that they
are part of a sequence: LOB attaches straight double quotes to the second and
third word, FLOB adds a 3 to indicate that they are part of a three word sequence
and then a number indicating their position in the sequence. Such tag sequences,
called ditto tags make sense only if you believe that the individual parts in a multi-
word expression lose their independent word-class membership. Even then, we
have to check very carefully, which particular multi-word sequences are treated
like this and decide whether we agree. The makers of BROWN and the BNC ob-
viously had a more traditional view of word classes, simply treating in front of
as a sequence of a preposition, a noun, and another preposition (BROWN) or
specifically the subcategory of (BNC).
Ditto tags are a way of tokenizing the corpus at orthographic word bound-
aries while allowing words to span more than one token. But tokenization itself
also differs across corpora. For example, BROWN tokenizes only at orthographic
word boundaries (white space or punctuation), while the other three corpora also
tokenize at clitic boundaries. They all treat the n’t in words like don't, doesn't,
etc. as separate tokens, labeling it XNOT (LOB), XX (FLOB) and XX0 (BNC), while
BROWN simply indicates that a word contains this clitic by attaching an asterisk
to the end of the POS tag (other clitics, like ‘ll, ’s, etc. are treated similarly).
It is clear, then, that tokenization and part-of-speech tagging are not inherent
in the text itself, but are the result of decisions by the corpus makers. But in what
sense can these decisions be said to constitute operational definitions? There are
two different answers to this question. The first answer is that the theories of
tokenization and word classes are (usually) explicitly described in the corpus
manual itself or in a guide as to how to apply the tag set. A good example of the
latter is Santorini (1990), the most-widely cited tagging guideline for the PENN
tagset developed for the PENN treebank but now widely used.
85
3 Corpus linguistics as a scientific method
As an example, consider the instructions for the POS tags DT and JJ, beginning
with the former:
Determiner – DT This category includes the articles a(n), every, no and the,
the indefinite determiners another, any and some, each, either (as in either
way), neither (as in neither decision), that, these, this and those, and instances
of all and both when they do not precede a determiner or possessive pro-
noun (as in all roads or both times). (Instances of all or both that do precede
a determiner or possessive pronoun are tagged as predeterminers (PDT).)
Since any noun phrase can contain at most one determiner, the fact that
such can occur together with a determiner (as in the only such case) means
that it should be tagged as an adjective (JJ), unless it precedes a determiner,
as in such a good time, in which case it is a predeterminer (PDT). (Santorini
1990: 2)
Note that special cases are also listed in the definition of DT, which contains
a discussion of grammatical contexts under which the words listed at the begin-
ning of the definition should instead be tagged as predeterminers (PDT) or adjec-
tives (JJ). There is also an entire section in the tagging guidelines that deals with
special, exceptional or generally unclear cases, as an example, consider the pas-
sage distinguishing uses of certain words as conjunctions (CC) and determiners
(DT):
CC or DT When they are the first members of the double conjunctions both
... and, either ... or and neither ... nor, both, either and neither are tagged as
coordinating conjunctions (CC), not as determiners (DT).
86
3.2 Operationalization
The mixture of reliance on generally accepted terminology, word lists and il-
lustrations is typical of tagging guidelines (and, as we saw in Section 3.2, of anno-
tation schemes in general). Nevertheless, such tagging guidelines can probably
be applied with a relatively high degree of interrater reliability (although I am
not aware of a study testing this), but they require considerable skill and expe-
rience (try to annotate a passage from your favorite novel or a short newspaper
article to see how quickly you run into problems that require some very deep
thinking).
However, POS tagging is not usually done by skilled, experienced annotators,
bringing us to the second, completely different way in which POS tags are based
on operational definitions. The usual way in which corpora are annotated for
parts of speech is by processing them using a specialized software application
called tagger (a good example is the Tree Tagger (Schmid 1994), which can be
downloaded, studied and used relatively freely).
Put simply, these taggers work as follows: For each word, they take into ac-
count the probabilities with which the word is tagged as A, B, C, etc., and the
probability that a word tagged as A, B, C should occur at this point given the tag
assigned to the preceding word. The tagger essentially multiplies both probabil-
ities and then chooses the tag with the highest joint probability. As an example,
consider the word cost in (20b), the beginning of which I repeat here:
The wordform cost has a probability of 0.73 (73 percent) of representing a noun
and a probability of 0.27 (27 percent) of representing a verb. If the tagger simply
went by these probabilities, it would assign the tag NN. However, the probability
that modal verb is followed by a noun is 0.01 (1 percent), while the probability
that it is followed by a verb is 0.8 (80 percent). The tagger now multiplies the
87
3 Corpus linguistics as a scientific method
probabilities for noun (0.73 × 0.01 = 0.0072) and for verb (0.27 × 0.8 = 0.216). Since
the latter is much higher, the tagger will tag the word (correctly, in this case, as
a verb).
But how does the tagger know these probabilities? It has to “learn” them from
a corpus that has been annotated by hand by skilled, experienced annotators
based on a reliable, valid annotation scheme. Obviously, the larger this corpus,
the more accurate the probabilities, the more likely that the tagger will be correct.
I will return to this point presently, but first, note that in corpora which have been
POS tagged automatically, the tagger itself and the probabilities it uses are the
operational definition. In terms of reliability, this is a good thing: If we apply the
same tagger to the same text several times, it will give us the same result every
time.
In terms of validity, this is a bad thing in two ways: first, because the tagger
assigns tags based on learned probabilities rather than definitions. This is likely
to work better in some situations than in others, which means that incorrectly
assigned tags will not be distributed randomly across parts of speech. For exam-
ple, the is unlikely to be tagged incorrectly, as it is always a determiner, but that
is more likely to be tagged incorrectly, as it is a conjunction about two thirds of
the time and a determiner about one third of the time. Likewise, horse is unlikely
to be tagged incorrectly as it is a noun 99 percent of the time, but riding is more
likely to be tagged incorrectly, as it is a noun about 15 percent of the time and a
verb about 85 percent of the time. A sequence like the horse is almost certain to
be tagged correctly, but a sequence like that riding much less so. What is worse,
in the latter case, whether riding will be tagged correctly depends on whether
that has been tagged correctly. If that has been tagged as a determiner, riding
will be (correctly) tagged as a noun, as verbs never follow determiners and the
joint probability that it is a verb will be zero. In contrast, if that has been tagged
as a conjunction, the tagger will tag riding as a verb: conjunctions are followed
by verbs with a probability of 0.16 and by nouns with a probability of 0.11, and
so the joint probability that it is a verb (0.16 × 0.85 = 0.136) is higher than the
joint probability that it is a noun (0.11 × 0.67 = 0.0165). This will not always be
the right decision, as (22) shows:
(22) [W]e did like to make it quite clear during our er discussions that riding
of horses on the highway is a matter for the TVP (BNC JS8)
In short, some classes of word forms (like ing-forms of verbs) are more difficult
to tag correctly than others, so incorrectly assigned tags will cluster around such
cases. This can lead to considerable distortions in the tagging of specific words
88
3.2 Operationalization
and grammatical constructions. For example, in the BNC, the word form regard
is systematically tagged incorrectly as a verb in the complex prepositions with
regard to and in regard to, but is correctly tagged as a noun in most instances of
the phrase in high regard. In other words, particular linguistic phenomena will be
severely misrepresented in the results of corpus queries based on automatically
assigned tags or parse trees.
Sometimes the probabilities of two possible tags are very close. In these cases,
some taggers will stoically assign the more probable tag even if the difference in
probabilities is small. Other taggers will assign so-called ambiguity or portman-
teau tags, as in the following example from the BNC:
89
3 Corpus linguistics as a scientific method
3.2.2.2 Length
There is a wide range of phenomena that has been claimed and/or shown to be
related to the weight of linguistic units (syllables, words or phrases) – word-order
phenomena following the principle “light before heavy”, such as the dative alter-
nation (Thompson & Koide 1987), particle placement (Chen 1986), s-possessive (or
“genitive”) and of -construction (Deane 1987) and frozen binominals (Sobkowiak
1993), to name just a few. In the context of such claims, weight is sometimes un-
derstood to refer to structural complexity, sometimes to length, and sometimes
to both. Since complexity is often difficult to define, it is, in fact, frequently op-
erationalized in terms of length, but let us first look at the difficulty of defining
length in its own right and briefly return to complexity below.
Let us begin with words. Clearly, words differ in length – everyone would
agree that the word stun is shorter than the word flabbergast. There are a number
of ways in which we could operationalize Word Length, all of which would
allow us to confirm this difference in length:
• as “number of letters” (cf., e.g., Wulff 2003), in which case flabbergast has
a length of 11 and stun has a length of 4;
2
Note that I have limited the discussion here to definitions of length that make sense in the
domain of traditional linguistic corpora; there are other definitions, such as phonetic length
90
3.2 Operationalization
91
3 Corpus linguistics as a scientific method
glish (and other stress-timed languages) process them phonemically (in which
case it depends on the phenomenon, which of the measures are more valid).3
Finally, note that of course phonemic and/or syllabic length correlate with
orthographic length to some extent (in languages with phonemic and syllabic
scripts), so we might use the easily and reliably measured orthographic length
as an operational definition of phonemic and/or syllabic length and assume that
mismatches will be infrequent enough to be lost in the statistical noise (cf. Wulff
2003).
When we want to measure the length of linguistic units above word level,
e.g. phrases, we can choose all of the above methods, but additionally or instead
we can (and more typically do) count the number of words and/or constituents
(cf. e.g. Gries (2003a) for a comparison of syllables and words as a measure of
length). Here, we have to decide whether to count orthographic words (which is
very reliable but may or may not be valid), or phonological words (which is less
reliable, as it depends on our theory of what constitutes a phonological word).
As mentioned at the beginning of this subsection, weight is sometimes under-
stood to refer to structural complexity rather than length. The question how to
measure structural complexity has been addressed in some detail in the case of
phrases, where it has been suggested that Complexity could be operationalized
as “number of nodes” in the tree diagram modeling the structure of the phrase
(cf. Wasow & Arnold 2003). Such a definition has a high validity, as “number of
nodes” directly corresponds to a central aspect of what it means for a phrase to
be syntactically complex, but as tree diagrams are highly theory-dependent, the
reliability across linguistic frameworks is low.
Structural complexity can also be operationalized at various levels for words.
The number of nodes could be counted in a phonological description of a word.
For example, two words with the same number of syllables may differ in the
complexity of those syllables: amaze and astound both have two syllables, but
the second syllable of amaze follows a simple CVC pattern, while that of astound
has the much more complex CCVCC pattern. The number of nodes could also be
counted in the morphological structure of a word. In this case, all of the words
mentioned above would have a length of one, except disconcert, which has a
length of 2 (dis + concert).
3
The difference between these two language types is that in stress-timed languages, the time
between two stressed syllables tends to be constant regardless of the number of unstressed
syllables in between, while in syllable-timed languages every syllable takes about the same
time to pronounce. This suggests an additional possibility for measuring length in stress-timed
languages, namely the number of stressed syllables. Again, I am not aware of any study that
has discussed the operationalization of word length at this level of detail.
92
3.2 Operationalization
(24) Referential Distance [...] assesses the gap between the previous occurrence
in the discourse of a referent/topic and its current occurrence in a clause,
where it is marked by a particular grammatical coding device. The gap is
thus expressed in terms of number of clauses to the left. The minimal value
that can be assigned is thus 1 clause [...] (Givón 1983: 13)
93
3 Corpus linguistics as a scientific method
(25) Joan, though Anne’s junior1 by a year and not yet fully accustomed2 to
the ways of the nobility, was3 by far the more worldly-wise of the two. She
watched4 , listened5 , learned6 and assessed7 , speaking8 only when spoken9
to in general – whilst all the while making10 her plans and looking11 to
the future... Enchanted12 at first by her good fortune in becoming13 Anne
Mowbray’s companion, grateful14 for the benefits showered15 upon her,
Joan rapidly became16 accustomed to her new role. (BNC CCD)
Let us assume the traditional definition of a clause as a finite verb and its
dependents and let us assume that only overt references are counted. If we apply
these definitions very narrowly, we would put the referential distance between
the initial mention of Joan and the first pronominal reference at 1, as Joan is a
dependent of was in clause (253 ) and there are no other finite verbs between this
mention and the pronoun she. A broader definition of clause along the lines of “a
unit expressing a complete proposition” however, might include the structures
(251 ) (though Anne’s junior by a year) and (252 ) (not yet fully accustomed to the
ways of the nobility) in which case the referential distance would be 3 (a similar
problem is posed by the potential clauses (2512 ) and (2514 ), which do not contain
finite verbs but do express complete propositions). Note that if we also count the
NP the two as including reference to the person named Joan, the distance to she
would be 1, regardless of how the clauses are counted.
In fact, the structures (251 ) and (252 ) pose an additional problem: they are de-
pendent clauses whose logical subject, although it is not expressed, is clearly
coreferential with Joan. It depends on our theory whether these covert logical
subjects are treated as elements of grammatical and/or semantic structure; if they
are, we would have to include them in the count.
The differences that decisions about covert mentions can make are even more
obvious when calculating the referential distance of the second pronoun, her
(in her plans). Again, assuming that every finite verb and its dependents form a
clause the distance between her and the previous use she is six clauses (254 to
259 ). However, in all six clauses, the logical subject is also Joan. If we include
these as mentions, the referential distance is 1 again (her good fortune is part of
the clause (2512 ) and the previous mention would be the covert reference by the
logical subject of clause (2511 )).
Finally, note that I have assumed a very flat, sequential understanding of “num-
ber of clauses” counting every finite verb separately. However, one could argue
that the sequence She watched4 , listened5 , learned6 and assessed7 is actually a
single clause with four coordinated verb phrases sharing the subject she, that
speaking8 only when spoken9 to in general is a single clause consisting of a matrix
clause and an embedded adverbial clause, and that this clause itself is dependent
94
3.2 Operationalization
on the clause with the four verb phrases. Thus, the sequence from (254 ) to (259 )
can be seen as consisting of six, two or even just one clause, depending on how
we decide to count clauses in the context of referential distance.
Obviously, there is no “right” or “wrong” way to count clauses; what matters
is that we specify a way of counting clauses that can be reliably applied and that
is valid with respect to what we are trying to measure. With respect to reliability,
obviously the simpler our specification, the better (simply counting every verb,
whether finite or not, might be a good compromise between the two definitions
mentioned above). With respect to validity, things are more complicated: refer-
ential distance is meant to measure the degree of activation of a referent, and
different assumptions about the hierarchical structure of the clauses in question
are going to have an impact on our assumptions concerning the activation of the
entities referred to by them.
Since specifying what counts as a clause and what does not is fairly complex, it
might be worth thinking about more objective, less theory-dependent measures
of distance, such as the number of (orthographic) words between two mentions
(I am not aware of studies that do this, but finding out to what extent the results
correlate with clause-based measures of various kinds seems worthwhile).
For practical as well as for theoretical reasons, it is plausible to introduce a cut-
off point for the number of clauses we search for a previous mention of a referent:
practically, it will become too time consuming to search beyond a certain point,
theoretically, it is arguable to what extent a distant previous occurrence of a
referent contributes to the current information status. Givón (1983) originally
set this cut-off point at 20 clauses, but there are also studies setting it at ten or
even at three clauses. Clearly, there is no “correct” number of clauses, but there is
empirical evidence that the relevant distinctions are those between a referential
distance of 1, between 2 and 3, and > 3 (cf. Givón 1992).
Note that, as an operational definition of “topicality” or “givenness”, it will
miss a range of referents that are “topical” or “given”. For example, there are
referents that are present in the minds of speakers because they are physically
present in the speech situation, or because they constitute salient shared knowl-
edge for them, or because they talked about them at a previous occasion, or be-
cause they were mentioned prior to the cut-off point. Such referents may already
be “given” at the point that they are first mentioned in the discourse.
Conversely, the definition may wrongly classify referents as discourse-active.
For example, in conversational data an entity may be referred to by one speaker
but be missed or misunderstood by the hearer, in which case it will not consti-
tute given information to the hearer (Givón originally intended the measure for
narrative data only, where this problem will not occur).
95
3 Corpus linguistics as a scientific method
Both Word Length and Discourse Status are phenomena that can be de-
fined in relatively objective, quantifiable terms – not quite as objectively as phys-
ical Hardness, perhaps, but with a comparable degree of rigor. Like Hardness
measures, they do not access reality directly and are dependent on a number
of assumptions and decisions, but providing that these are stated sufficiently
explicitly, they can be applied almost automatically. While Word Length and
Discourse Status are not the only such phenomena, they are not typical either.
Most phenomena that are of interest to linguists (and thus, to corpus linguists) re-
quire operational definitions that are more heavily dependent on interpretation.
Let us look at two such phenomena, Word Sense and Animacy.
96
3.2 Operationalization
There are three senses of pavement, as shown by the numbers attached, and in
each case there are synonyms. Of course, in order to turn this into an operational
definition, we need to specify a procedure that allows us to assign the hits in
our corpus to these categories. For example, we could try to replace the word
pavement by a unique synonym and see whether this changes the meaning. But
even this, as we saw in Section 3.1.2 above, may be quite difficult.
There is an additional problem: We are relying on someone else’s decisions
about which uses of a word constitute different senses. In the case of pavement,
this is fairly uncontroversial, but consider the entry for the noun bank:
(27) a. bank#1 (sloping land (especially the slope beside a body of water))
b. bank#2, depository financial institution#1, bank#2, banking concern#1,
banking company#1 (a financial institution that accepts deposits and
channels the money into lending activities)
c. bank#3 (a long ridge or pile)
d. bank#4 (an arrangement of similar objects in a row or in tiers)
e. bank#5 (a supply or stock held in reserve for future use (especially in
emergencies))
f. bank#6 (the funds held by a gambling house or the dealer in some gam-
bling games)
g. bank#7, cant#2, camber#2 (a slope in the turn of a road or track; the
outside is higher than the inside in order to reduce the effects of cen-
trifugal force)
h. savings bank#2, coin bank#1, money box#1, bank#8 (a container (usu-
ally with a slot in the top) for keeping money at home)
i. bank#9, bank building#1 (a building in which the business of banking
transacted)
j. bank#10 (a flight maneuver; aircraft tips laterally about its longitudinal
axis (especially in turning))
While everyone will presumably agree that (27a) and (27b) are separate senses
(or even separate words, i.e. homonyms), it is less clear whether everyone would
distinguish (27b) from (27i) and/or (27f); or (27e) and (27f), or even (27a) and (27g).
In these cases, one could argue that we are just dealing with contextual variants
of a single underlying meaning.
Thus, we have the choice of coming up with our own set of senses (which has
the advantage that it will fit more precisely into the general theoretical frame-
work we are working in and that we might find it easier to apply), or we can
97
3 Corpus linguistics as a scientific method
stick with an established set of senses such as that proposed by WordNet, which
has the advantage that it is maximally transparent to other researchers and that
we cannot subconsciously make it fit our own preconceptions, thus distorting
our results in the direction of our hypothesis. In either case, we must make the
set of senses and the criteria for applying them transparent, and in either case
we are dealing with an operational definition that does not correspond directly
with reality (if only because word senses tend to form a continuum rather than
a set of discrete categories in actual language use).
3.2.2.5 Animacy
The animacy of the referents of noun phrases plays a role in a range of grammati-
cal processes in many languages. In English, for example, it has been argued (and
shown) to be involved in the grammatical alternations already discussed above,
in other languages it is involved in grammatical gender, in alignment systems,
etc.
The simplest distinction in the domain of Animacy would be the following:
98
3.2 Operationalization
(30) human vs. animate vs. concrete inanimate vs. abstract inanimate
The distinction between concrete and abstract raises the practical issue where
to draw the line (for example, is electricity concrete?). It also raises a deeper issue
that we will return to: are we still dealing with a single dimension? Are abstract
inanimate entities (say, marriage or Wednesday) really less “animate” than con-
crete entities like a wedding ring or a calendar? And are animate and abstract
99
3 Corpus linguistics as a scientific method
incompatible, or would it not make sense to treat the referents of words like god,
demon, unicorn, etc. as abstract animate?
100
3.3 Hypotheses in context: The research cycle
(“error elimination”) by testing, but also by “critical discussion”, and P 2 stands for
new or additional problems and research questions arising from the falsification
process.4 Popper also acknowledges that it is good research practice to entertain
several hypotheses at once, if there is more than one promising explanation for
a problem situation, expanding his formula as follows:
He explicitly acknowledges that falsification, while central, is not the only cri-
terion by which science proceeds: if there are several unfalsified hypotheses, we
may also assess them based on which promises the most insightful explanation
or which produces the most interesting additional hypotheses (Popper 1970: 3).
Crucially, (31) and (32) suggest a cyclic and incremental approach to research:
the status quo in a given field is the result of a long process of producing new
hypotheses and eliminating errors, and it will, in turn, serve as a basis for more
new hypotheses (and more errors which need to be eliminated). This incremental
cyclicity can actually be observed in scientific disciplines. In some, like physics
or psychology, researchers make this very explicit, publishing research in the
form of series of experiments attempting to falsify certain existing hypotheses
and corroborating others, typically building on earlier experiments by others or
themselves and closing with open questions and avenues of future research. In
other disciplines, like the more humanities-leaning social sciences including (cor-
pus) linguistics, the cycle is typically less explicit, but viewed from a distance, re-
searchers also follow this procedure, summarizing the ideas of previous authors
(sometimes to epic lengths) and then adding more or less substantial data and
arguments of their own.
Fleshing out Popper’s basic schema in (31) above, drawing together the points
discussed in this and the previous chapter, we can represent this cycle as shown
in Figure 3.1.
Research begins with a general question – something that intrigues an individ-
ual or a group of researchers. The part of reality to which this question pertains is
then modeled, i.e., described in terms of theoretical constructs, enabling us to for-
mulate, first, a more specific research question, and often, second, a hypothesis.
4
Of course, Popper did not invent, or claim to have invented, this procedure. He was simply
explicating what he thought successful scientists were, and ought to be, doing (Rudolph (2005)
traces the explicit recognition of this procedure to John Dewey’s still very readable How we
think (Dewey 1910), which contains insightful illustrations).
101
3 Corpus linguistics as a scientific method
P TT EE
● Annotation
Modeling
(Constructs) Predictions Analysis
Falsification
Corroboration
Research
!estion Hypothesis Design
Check
There is nothing automatic about these steps – they are typically characterized
by lengthy critical discussion, false starts or wild speculation, until testable hy-
potheses emerge (in some disciplines, this stage has not yet been, and in some
cases probably never will be reached). Next, predictions must be derived, requir-
ing operational definitions of the constructs posited previously. This may require
some back and forth between formulating predictions and providing sufficiently
precise operationalizations.
Next, the predictions must be tested – in the case of corpus linguistics, cor-
pora must be selected and data must be retrieved and annotated, something we
will discuss in detail in the next chapter. Then the data are analyzed with respect
to the hypothesis. If they corroborate the hypothesis (or at least fail to falsify
it), this is not the end of the process: with Popper, we should only begin to ac-
cept evidence as corroborating when it emerges from repeated attempts to falsify
the hypothesis. Thus, additional tests must be, and typically are, devised. If the
results of any test falsify the hypothesis, this does not, of course, lead to its im-
mediate rejection. After all, we have typically arrived at our hypothesis based on
good arguments, and so researchers will typically perform what we could call a
“design check” on their experiment, looking closely at their predictions to see if
they really follow from the hypothesis, the operational definitions to see whether
they are reliable and valid with respect to the constructs they represent, and the
test itself to determine whether there are errors or confounding variables in the
data selection and analysis. If potential problems are found, they will be fixed
102
3.3 Hypotheses in context: The research cycle
and the test will be repeated. Only if it fails repeatedly will researchers abandon
(or modify) the hypothesis.
The repeated testing, and especially the modification of a hypothesis is inher-
ently dangerous, as we might be so attached to our hypothesis that we will keep
testing it long after we should have given it up, or that we will try to save it
by changing it just enough that our test will no longer falsify it, or by making
it completely untestable (cf. Popper 1963: 37). This must, of course, be avoided,
but so must throwing out a hypothesis, or an entire model, on the basis of a
single falsification event. Occasionally, especially in half-mature disciplines like
linguistics, models morph into competing schools of thought, each vigorously
defended by its adherents even in the face of a growing number of phenomena
that they fail to account for. In such cases, a radical break in the research cycles
within these models may be necessary to make any headway at all – a so-called
“paradigm shifts” occurs. This means that researchers abandon the current model
wholesale and start from scratch based on different initial assumptions (see Kuhn
1962). Corpus linguistics with its explicit recognition that generalizations about
the language system can and must be deduced from language usage may present
such a paradigm shift with respect to the intuition-driven generative models.
Finally, note that the scientific research cycle is not only incremental, with
each new hypothesis and each new test building on previous research, but that
it is also collaborative, with one researcher or group of researchers picking up
where another left off. This collaborative nature of research requires researchers
to be maximally transparent with respect to their research designs, laying open
their data and methods in sufficient detail for other researchers to understand
exactly what prediction was tested, how the constructs in question were oper-
ationalized, how data were retrieved and analyzed. Again, this is the norm in
disciplines like experimental physics and psychology, but not so much so in the
more humanities-leaning disciplines, which tend to put the focus on ideas and
arguments rather than methods. We will deal with data retrieval and annotation
in the next chapter and return to the issue of methodological transparency at the
end of it.
103
4 Data retrieval and annotation
Traditionally, many corpus-linguistic studies use the (orthographic) word form
as their starting point. This is at least in part due to the fact that corpora consist
of text that is represented as a sequence of word forms, and that, consequently,
word forms are easy to retrieve. As briefly discussed in Chapter 2, concordancing
software allows us to query the corpus for a string of characters and displays the
result as a list of hits in context.
As we saw when discussing the case of pavement in Chapter 3, a corpus query
for a string of characters like ⟨ pavement ⟩ may give us more than we want – it
will return not only hits corresponding to the word sense ‘pedestrian footpath’,
which we could contrast with the synonym sidewalk, but also those correspond-
ing to the word sense ‘hard surface’ (which we could contrast with the synonym
paving).
The query may, at the same time, give us less than we want, because it would
only return the singular form of the word and only if it is spelled entirely in
lower case. A study of the word (in either or both of its senses) would obviously
require that we look at the lemma PAVEMENT, comprising at least the word
forms pavement (singular), pavements (plural) and, depending on how the corpus
is prepared, pavement’s (possessive). It also requires that we include in our query
all possible graphemic variants, comprising at least cases in lower case, with
an initial capital (Pavement, Pavements, Pavement’s, e.g. at the beginning of a
sentence), or in all caps (PAVEMENT, PAVEMENTS, PAVEMENT’S), but, depending
on the corpus, also hyphenated cases occurring at a line break (e.g. pave-¶ment,
with ¶ standing for the line break).
In Chapter 3, we implicitly treated the second issue as a problem of retrieval,
noting in passing that we queried our corpus in such a way as to capture all
variants of the lemma PAVEMENT. We treated the first issue as a problem of cat-
egorization – we went through the results of our query one by one, determining
from the context, which of the senses of pavement we were likely dealing with.
In the context of a research project, our decisions would be recorded together
with the data in some way – we would annotate the data, using an agreed-upon
code for each of the categories (e.g., word senses).
4 Data retrieval and annotation
Retrieval is a non-trivial issue even when dealing with individual lexical items
whose orthographic representations are not ambiguous. The more complex the
phenomena under investigation are, the more complex these issues become, re-
quiring careful thought and a number of decisions concerning an almost in-
evitable trade-off between the quality of the results and the time needed to re-
trieve them. This issue will be dealt with in Section 4.1. We already saw that the
issue of data annotation is extremely complex even in the case of individual lexi-
cal items, and the preceding chapter discussed some more complicated examples.
This issue will be dealt with in more detail in Section 4.2.
4.1 Retrieval
Broadly speaking, there are two ways of searching a corpus for a particular lin-
guistic phenomenon: manually (i.e., by reading the texts contained in it, noting
down each instance of the phenomenon in question) or automatically (i.e., by
using a computer program to run a query on a machine-readable version of the
texts). As discussed in Chapter 2, there may be cases where there is no readily
apparent alternative to a fully manual search, and we will come back to such
cases below.
However, as also discussed in Chapter 2, software-aided queries are the default
in modern corpus linguistics, and so we take these as a starting point of our
discussion.
106
4.1 Retrieval
assumed to hold in the examples discussed in the previous chapter), a corpus will
contain plain text in a standard orthography and the software will be able to find
passages matching a specific string of characters. Essentially, this is something
every word processor is capable of.
Most concordancing programs can do more than this, however. For example,
they typically allow the researcher to formulate queries that match not just one
string, but a class of strings. One fairly standardized way of achieving this is by
using so-called regular expressions – strings that may contain not just simple char-
acters, but also symbols referring to classes of characters or symbols affecting the
interpretation of characters. For example, the lexeme sidewalk, has (at least) six
possible orthographic representations: sidewalk, side-walk, Sidewalk, Side-walk,
sidewalks, side-walks, Sidewalks and Side-walks (in older texts, it is sometimes
spelled as two separate words, which means that we have to add at least side
walk, side walks, Side walk and Side walks when investigating such texts). In or-
der to retrieve all occurrences of the lexeme, we could perform a separate query
for each of these strings, but I actually queried the string in (1a); a second exam-
ple of regular expressions is (1b), which represents one way of searching for all
inflected forms and spelling variants of the verb synthesize (as long as they are
in lower case):
107
4 Data retrieval and annotation
they quickly overgeneralize. The pattern would also, for example, match some
non-existing forms, like synthesizding, and, more crucially, it will match exist-
ing forms that we may not want to include in our search results, like the noun
synthesis (see further Section 4.1.2).
The benefits of being able to define complex queries become even more ob-
vious if our corpus contains annotations in addition to the original text, as dis-
cussed in Section 2.1.4 of Chapter 2. If the corpus contains part-of-speech tags, for
example, this will allow us to search (within limits) for grammatical structures.
For example, assume that there is a part-of-speech tag attached to the end of ev-
ery word by an underscore (as in the BROWN corpus, see Figure 2.4 in Chapter 2)
and that the tags are as shown in (2) (following the sometimes rather nontrans-
parent BROWN naming conventions). We could then search for prepositional
phrases using a pattern like the one in (3):
(2) preposition _IN
articles _AT
adjectives _JJ (uninflected)
_JJR (comparative)
_JJT (superlative)
nouns _NN (common singular nouns)
_NNS (common plural nouns)
_NN$ (common nouns with possessive clitic)
_NP (proper names)
_NP$ (proper nouns with possessive clitic)
108
4.1 Retrieval
The query in (3) makes use of the annotation in the corpus (in this case, the
part-of-speech tagging), but it does so in a somewhat cumbersome way by treat-
ing word forms and the tags attached to them as strings. As shown in Figure 2.5 in
Chapter 2, corpora often contain multiple annotations for each word form – part
of speech, lemma, in some cases even grammatical structure. Some concordance
programs, such as the widely-used open-source Corpus Workbench (including
its web-based version CQPweb) (cf. Evert & Hardie 2011) or the Sketch Engine
and its open-source variant NoSketch engine (cf. Kilgarriff et al. 2014) are able
to “understand” the structure of such annotations and offer a query syntax that
allows the researcher to refer to this structure directly.
The two programs just mentioned share a query syntax called CQP (for “Cor-
pus Query Processor”) in the Corpus Workbench and CQL (for “Corpus Query
Language”) in the (No)Sketch Engine. This syntax is very powerful, allowing us
to query the corpus for tokens or sequences of tokens at any level of annotation.
It is also very transparent: each token is represented as a value-attribute pair in
square brackets, as shown in (4):
(4) [attribute="value"]
The attribute refers to the level of annotation (e.g. word, pos, lemma or whatever
else the makers of a corpus have called their annotations), the value refers to
what we are looking for. For example, a query for the different forms of the word
synthesize (cf. (1) above) would look as shown in (5a), or, if the corpus contains
information about the lemma of each word form, as shown in (5b), and the query
for PPs in (3) would look as shown in (5c):
(5) a. [word="synthesi[sz]e?[ds]?(ing)?"]
b. [lemma="synthesize"]
c. [pos="IN"] [pos="AT"]? [pos="JJ[RT]"] [pos="N[PN][S$]?"]+
As you can see, we can use regular expressions inside the values for the at-
tributes, and we can use the asterisk, question mark and plus outside the token
to indicate that the query should match “zero or more”, “zero or one” and “one or
109
4 Data retrieval and annotation
more” tokens with the specified properties. Note that CQP syntax is case sensi-
tive, so for example (5a) would only return hits that are in lower case. If we want
the query to be case-insensitive, we have to attach %c to the relevant values.
We can also combine two or more attribute-value pairs inside a pair of square
brackets to search for tokens satisfying particular conditions at different levels
of annotation. For example, (6a) will find all instances of the word form walk
tagged as a verb, while (6b) will find all instances tagged as a noun. We can
also address different levels of annotation at different positions in a query. For
example, (6c) will find all instances of the word form walk followed by a word
tagged as a preposition, and (6d) corresponds to the query ⟨ through the NOUN
of POSS.PRON car ⟩ mentioned in Section 3.1.1 of Chapter 3 (note the %c that
makes all queries for words case-insensitive):
This query syntax is so transparent and widely used that we will treat it as
a standard in the remainder of the book and use it to describe queries. This is
obviously useful if you are using one of the systems mentioned above, but if
not, the transparency of the syntax should allow you to translate the query into
whatever possibilities your concordancer offers you. When talking about a query
in a particular corpus, I will use the annotation (e.g., the part-of-speech tags) used
in that corpus, when talking about queries in general, I will use generic values
like noun or prep., shown in lower case to indicate that they do not correspond
to a particular corpus annotation.
Of course, even the most powerful query syntax can only work with what is
there. Retrieving instances of phrasal categories based on part-of-speech anno-
tation is only possible to a certain extent: even a complex query like that in (5c)
or in (3) will not return all prepositional phrases. These queries will not, for ex-
ample, match cases where the adjective is preceded by one or more quantifiers
(tagged _QT in the BROWN corpus), adverbs (tagged _RB), or combinations of
the two. It will also not return cases with pronouns instead of nouns. These and
other issues can be fixed by augmenting the query accordingly, although the in-
creasing complexity will bring problems of its own, to which we will return in
the next subsection.
110
4.1 Retrieval
Other problems are impossible to fix; for example, if the noun phrase inside
the PP contains another PP, the pattern will not recognize it as belonging to the
NP but will treat it as a new match and there is nothing we can do about this,
since there is no difference between the sequence of POS tags in a structures like
(7a), where the PP off the kitchen is a complement of the noun room and as such
is part of the NP inside the first PP, and (7b), where the PP at a party is an adjunct
of the verb standing and as such is not part of the NP preceding it:
1. include hits that are instances of the phenomenon we are looking for (these
are referred to as a true positives or hits, but note that we are using the word
hit in a broader sense to mean “anything returned as a result of a query”);
2. include hits that are not instances of our phenomenon (these are referred
to as a false positives);
4. fail to include strings that are not instances of our phenomenon (this is
referred to as a true negative).
111
4 Data retrieval and annotation
Table 4.1: Four possible outcomes of a corpus query for a given phe-
nomenon X
Search result
Included Not included
Corpus X True positive False negative
(hit) (miss)
¬X False positive True negative
(false alarm) (correct rejection)
Obviously, the first case (true positive) and the fourth case (true negative) are
desirable outcomes: we want our search results to include all instances of the
phenomenon under investigation and exclude everything that is not such an in-
stance. The second case (false negative) and the third case (false positive) are
undesirable outcomes: we do not want our query to miss instances of the phe-
nomenon in question and we do not want our search results to incorrectly include
strings that are not instances of it.
We can describe the quality of a data set that we have retrieved from a corpus in
terms of two measures. First, the proportion of positives (i.e., strings returned by
our query) that are true positives; this is referred to as precision, or as the positive
predictive value, cf. (8a). Second, the proportion of all instances of our phenome-
non that are true positives (i.e., that were actually returned by our query; this is
referred to as recall, cf. (8b):1
True Positives
(8) a. Precision =
True Positives + False Positives
1
There are two additional measures that are important in other areas of empirical research but
do not play a central role in corpus-linguistic data retrieval. First, the specificity or true negative
rate – the proportion of negatives that are incorrectly included in our data (i.e. false negatives);
second, negative predictive value – the proportion of negatives (i.e., cases not included in our
search) that are true negatives (i.e., that are correctly rejected). These measures play a role in
situations where a negative outcome of a test is relevant (for example, with medical diagnoses);
in corpus linguistics, this is generally not the case. There are also various scores that combine
individual measures to give us an overall idea of the accuracy of a test, for example, the F1 score,
defined as the harmonic mean of precision and recall. Such scores are useful in information
retrieval or machine learning, but less so in corpus-linguistic research projects, where precision
and recall must typically be assessed independently of, and weighed against, each other.
112
4.1 Retrieval
True Positives
b. Recall =
True Positives + False Negatives
Ideally, the value of both measures should be 1, i.e., our data should include all
cases of the phenomenon under investigation (a recall rate of 100 percent) and it
should include nothing that is not a case of this phenomenon (a precision of 100
percent). However, unless we carefully search our corpus manually (a possibility
I will return to below), there is typically a trade-off between the two. Either we
devise a query that matches only clear cases of the phenomenon we are interested
in (high precision) but that will miss many less clear cases (low recall). Or we
devise a query that matches as many potential cases as possible (high recall),
but that will include many cases that are not instances of our phenomenon (low
precision).
Let us look at a specific example, the English ditransitive construction, and
let us assume that we have an untagged and unparsed corpus. How could we re-
trieve instances of the ditransitive? As the first object of a ditransitive is usually
a pronoun (in the objective case) and the second a lexical NP (see, for example,
Thompson & Koide 1987), one possibility would be to search for a pronoun fol-
lowed by a determiner (i.e., for any member of the set of strings in (9a)), followed
by any member of the set of strings in (9b)). This gives us the query in (9c), which
is long, but not very complex:
Let us apply this query (which is actually used in Colleman & De Clerck 2011)
to a freely available sample from the British Component of the International
Corpus of English mentioned in Chapter 2 above (see the Study Notes to the
current chapter for details). This corpus has been manually annotated, amongst
other things, for argument structure, so that we can check the results of our query
against this annotation.
There are 36 ditransitive clauses in the sample, thirteen of which are returned
by our query. There are also 2838 clauses that are not ditransitive, 14 of which
are also returned by our query. Table 4.2 shows the results of the query in terms
of true and false positives and negatives.
113
4 Data retrieval and annotation
Search result
Included Not included Total
Corpus Ditransitive 13 23 36
true positives false negatives
¬ Ditransitive 14 2824 2838
false positives true negatives
(11) a. ... one of the experiences that went towards making me a Christian...
b. I still ring her a lot.
c. I told her that I’d had to take these tablets
d. It seems to me that they they tend to come from
e. Do you need your caffeine fix before you this
114
4.1 Retrieval
could never be excluded, since they are identical to the ditransitive as far as the
sequence of parts-of-speech is concerned.
Of course, it is relatively trivial, in principle, to increase the precision of our
search results: we can manually discard all false positives, which would increase
precision to the maximum value of 1. Typically, our data will have to be manually
annotated for various criteria anyway, allowing us to discard false positives in
the process. However, the larger our data set, the more time consuming this will
become, so precision should always be a consideration even at the stage of data
retrieval.
Let us now look at the reasons for the recall rate, which is even worse than
the precision. There are, roughly speaking, four types of ditransitive structures
that our query misses, exemplified in (12a–e):
The first group of cases are those where the second object does not appear
in its canonical position, for example in interrogatives and other cases of left-
dislocation (cf. 12a), or passives (12b). The second group of cases are those where
word order is canonical, but either the first object (12c) or the second object (12d)
or both (12e) do not correspond to the query.
Note that, unlike precision, the recall rate of a query cannot be increased after
the data have been extracted from the corpus. Thus, an important aspect in con-
structing a query is to annotate a random sample of our corpus manual for the
phenomenon we are interested in, and then to check our query against this man-
ual annotation. This will not only tell us how good or bad the recall of our query
is, it will also provide information about the most frequent cases we are missing.
Once we know this, we can try to revise our query to take these cases into ac-
count. In a POS-tagged corpus, we could, for example, search for a sequence of
a pronoun and a noun in addition to the sequence pronoun-determiner that we
used above, which would give us cases like (12d), or we could search for forms of
be followed by a past participle followed by a determiner or noun, which would
give us passives like those in (12b).
In some cases, however, there may not be any additional patterns that we can
reasonably search for. In the present example with an untagged corpus, for ex-
ample, there is no additional pattern that seems in any way promising. In such
115
4 Data retrieval and annotation
cases, we have two options for dealing with low recall: First, we can check (in our
manually annotated subcorpus) whether the data that were recalled differ from
the data that were not recalled in any way that is relevant for our research ques-
tion. If this is not the case, we might decide to continue working with a low recall
and hope that our results are still generalizable – Colleman & De Clerck (2011),
for example, are mainly interested in the question which classes of verbs were
used ditransitively at what time in the history of English, a question that they
were able to discuss insightfully based on the subset of ditransitives matching
their query.
If our data do differ along one or more of the dimensions relevant to our re-
search project, we might have to increase the recall at the expense of precision
and spend more time weeding out false positives. In the most extreme case, this
might entail extracting the data manually, so let us return to this possibility in
light of the current example.
116
4.1 Retrieval
The first problem is that while the expressions in (13a–c) may refer to feelings
of anger or rage, they can also occur in their literal meaning, as the corresponding
authentic examples in (14a–c) show:
(14) a. “Now, after I am burned up,” he said, snatching my wrist, “and the fire
is out, you must scatter the ashes. ...” (Anne Rice, The Vampire Lestat)
b. As soon as the driver saw the train which had been hidden by the curve,
he let off steam and checked the engine... (Galignani, Accident on the
Paris and Orleans Railway)
c. Heat water in saucepan on highest setting until you reach the boiling
point and it starts to boil gently. (www.sugarfreestevia.net)
Obviously, there is no query that would find the examples in (13) but not those
in (14). In contrast, it is very easy for a human to recognize the examples in (14)
as literal. If we are explicitly interested in metaphors involving liquids and/or
heat, we could choose a semi-manual approach, first extracting all instances of
words from the field of liquids and/or heat and then discarding all cases that are
not metaphorical. This kind of approach is used quite fruitfully, for example, by
Deignan (2005), amongst others.
If we are interested in metaphors of anger in general, however, this approach
will not work, since we have no way of knowing beforehand which semantic
fields to include in our query. This is precisely the situation where exhaustive
retrieval can only be achieved by a manual corpus search, i.e., by reading the
entire corpus and deciding for each word, phrase or clause, whether it constitutes
an example of the phenomenon we are looking for. Thus, it is not surprising that
many corpus-linguistic studies on metaphor are based on manual searches (see,
for example, Semino & Masci (1996) or Jäkel (1997) for very thorough early studies
of this kind).
However, as mentioned in Chapter 2, manual searches are very time-consum-
ing and this limits their practical applicability: either we search large corpora, in
which case manual searching is going to take more time and human resources
than are realistically available, or we perform the search in a realistic time-frame
and with the human resources realistically available, in which case we have to
limit the size of our corpus so severely that the search results can no longer be
considered representative of the language as a whole. Thus, manual searches
117
4 Data retrieval and annotation
are useful mainly in the context of research projects looking at a linguistic phe-
nomenon in some clearly defined subtype of language (for example, metaphor in
political speeches, see Charteris-Black 2005).
When searching corpora for such hard-to-retrieve phenomena, it may some-
times be possible to limit the analysis usefully to a subset of the available data, as
shown in the previous subsection, where limiting the query for the ditransitive to
active declarative clauses with canonical word order still yielded potentially use-
ful results. It depends on the phenomenon and the imagination of the researcher
to find such easier-to-retrieve subsets.
To take up the example of metaphors introduced above, consider the examples
in (15), which are quite close in meaning to the corresponding examples in (13a–c)
above (also from Lakoff & Kövecses 1987: 189, 203):
In these cases, the PPs by/with anger/rage make it clear that consume, (be) filled
and brimming are not used literally. If we limit ourselves just to metaphorical
expressions of this type, i.e. expressions that explicitly mention both semantic
fields involved in the metaphorical expression, it becomes possible to retrieve
metaphors of anger semi-automatically. We could construct a query that would
retrieve all instances of the lemmas ANGER, RAGE, FURY, and other synonyms
of anger, and then select those results that also contain (within the same clause
or within a window of a given number of words) vocabulary from domains like
‘liquids’, ‘heat’, ‘containers’, etc. This can be done manually by going through the
concordance line by line (see, e.g., Tissari (2003) and Stefanowitsch (2004; 2006c),
cf. also Section 11.2.2 of Chapter 11), or automatically by running a second query
on the results of the first (or by running a complex query for words from both
semantic fields at the same time, see Martin 2006). The first approach is more
useful if we are interested in metaphors involving any semantic domain in addi-
tion to ‘anger’, the second approach is more useful (because more economical) in
cases where we are interested in metaphors involving specific semantic domains.
Limiting the focus to a subset of cases sharing a particular formal feature is
a feasible strategy in other areas of linguistics, too. For example, Heyd (2016)
wants to investigate “narratives of belonging” – roughly, stretches of discourse
in which members of a diaspora community talk about shared life experiences for
the purpose of affirming their community membership. At first glance, this is the
kind of potentially fuzzy concept that should give corpus linguists nightmares,
118
4.1 Retrieval
even after Heyd (2016: 292) operationalizes it in terms of four relatively narrow
criteria that the content of a stretch of discourse must fulfill in order to count
as an example. Briefly, it must refer to experiences of the speaker themselves, it
must mention actual specific events, it must contain language referring to some
aspect of migration, and it must contain an evaluation of the events narrated.
Obviously it is impossible to search a corpus based on these criteria. Therefore,
Heyd chooses a two-step strategy (Heyd 2016: 294): first, she queries her corpus
for the strings born in, moved to and grew up in, which are very basic, presumably
wide-spread ways of mentioning central aspects of one’s personal migration bi-
ography, and second, she assesses the stretches of discourse within which these
strings occur on the basis of her criteria, discarding those that do not fulfill all
four of them (this step is somewhere between retrieval and annotation).
As in the example of the ditransitive construction discussed above, retrieval
strategies like those used by Stefanowitsch (2006c) and Heyd (2016) are useful
where we can plausibly argue – or better yet, show – that the results are compa-
rable to the results we would get if we extracted the phenomenon completely.
In cases, where the phenomenon in question does not have any consistent for-
mal features that would allow us to construct a query, and cannot plausibly be
restricted to a subset that does have such features, a mixed strategy of elicita-
tion and corpus query may be possible. For example, Levin (2014) is interested in
what he calls the “Bathroom Formula”, which he defines as “clauses and phrases
expressing speakers’ need to leave any ongoing activity in order to go to the bath-
room” Levin (2014: 2), i.e. to the toilet (sorry to offend American sensitivities2 ).
This speech act is realized by phrases as diverse as (16a–c):
There is no way to search for these expressions (and others with the same
function) unless you are willing to read through the entire BNC – or unless you
already know what to look for. Levin (2014) chooses a strategy based on the
latter: he first assembles a list of expressions from the research literature on eu-
phemisms and complement this by asking five native speakers for additional ex-
amples. He then searches for these phrases and analyzes their distribution across
varieties and demographic variables like gender and class/social stratum.
2
See Manning & Melchiori (1974), who show that the word toilet is very upsetting even to
American college students.
119
4 Data retrieval and annotation
Of course, this query will miss any expressions that were not part of their
initial list, but the conditional distribution of those expressions that are included
may still yield interesting results – we can still learn something about which of
these expressions are preferred in a particular variety, by a particular group of
speakers, in a particular situation, etc.
If we assemble our initial list of expressions systematically, perhaps from a
larger number of native speakers that are representative of the speech commu-
nity in question in terms of regional origin, sex, age group, educational back-
ground, etc., we should end up with a representative sample of expressions to
base our query on. If we make our query flexible enough, we will likely even cap-
ture additional variants of these expressions. If other strategies are not available,
this is certainly a feasible approach. Of course, this approach only works with
relatively routinized speech event categories like the Bathroom Formula – greet-
ings and farewells, asking for the time, proposing marriage, etc. – which, while
they do not have any invariable formal features, do not vary infinitely either.
To sum up, it depends on the phenomenon under investigation and on the
research question whether we can take an automatic or at least a semi-automatic
approach or whether we have to resort to manual data extraction. Obviously, the
more completely we can extract our object of research from the corpus, the better.
4.2 Annotating
Once the data have been extracted from the corpus (and, if necessary, false pos-
itives have been removed), they typically have to be annotated in terms of the
variables relevant for the research question. In some cases, the variables and their
values will be provided externally; they may, for example, follow from the struc-
ture of the corpus itself, as in the case of british english vs. american english
defined as “occurring in the LOB corpus” and “occurring in the BROWN corpus”
respectively. In other cases, the variables and their values may have been oper-
ationalized in terms of criteria that can be applied objectively (as in the case of
Length defined as “number of letters”). In most cases, however, some degree of
interpretation will be involved (as in the case of Animacy or the metaphors dis-
cussed above). Whatever the case, we need an annotation scheme – an explicit
statement of the operational definitions applied. Of course, such an annotation
scheme is especially important in cases where interpretative judgments are in-
volved in categorizing the data. In this case, the annotation scheme should con-
tain not just operational definitions, but also explicit guidelines as to how these
definitions should be applied to the corpus data. These guidelines must be explicit
120
4.2 Annotating
121
4 Data retrieval and annotation
122
4.2 Annotating
specifying the labels by which these categories are to be represented. For exam-
ple, the distinctions between different degrees of Animacy need to be defined
in a way that allows us to identify them in corpus data (this is the annotation
scheme, cf. below), and the scheme needs to specify names for these categories
(for example, the category containing animate entities could be labelled by the
codes animate, anim, #01, cat:8472, etc. – as long as we know what the label
stands for, we can choose it randomly).
In order to keep different research projects in a particular area comparable, it is
desirable to create annotation and coding schemes independently of a particular
research project. However, the field of corpus linguistics is not well-established
and methodologically mature enough yet to have yielded uncontroversial and
widely applicable annotation schemes for most linguistic phenomena. There are
some exceptions, such as the part-of-speech tag sets and the parsing schemes
used by various wide-spread automatic taggers and parsers, which have become
de facto standards by virtue of being easily applied to new data; there are also
some substantial attempts to create annotation schemes for the manual anno-
tation of phenomena like topicality (cf. Givón 1983), animacy (cf. Zaenen et al.
2004), and the grammatical description of English sentences (e.g. Sampson 1995).
Whenever it is feasible, we should use existing annotation schemes instead of
creating our own – searching the literature for such schemes should be a routine
step in the planning of a research project. Often, however, such a search will come
up empty, or existing annotation schemes will not be suitable for the specific data
we plan to use or they may be incompatible with our theoretical assumptions. In
these cases, we have to create our own annotation schemes.
The first step in creating an annotation scheme for a particular variable con-
sists in deciding on a set of values that this variable may take. As the example of
Animacy in Chapter 3) shows, this decision is loosely constrained by our general
operational definition, but the ultimate decision is up to us and must be justified
within the context of our theoretical assumptions and our specific research ques-
tion.
There are, in addition, several general criteria that the set of values for any
variable must meet. First, they must be non-overlapping. This may seem obvious,
but it is not at all unusual, for example, to find continuous dimensions split up
into overlapping categories, as in the following quotation:
123
4 Data retrieval and annotation
Here, the authors obviously summarized the ages of their subjects into the fol-
lowing four classes: (I) 25–35, (II) 35–45, (III) 45–55 and (IV) 55–65: thus, subjects
aged 35 could be assigned to class I or class II, subjects aged 45 to class II or class
III, and subjects aged 55 to class III or class IV. This must be avoided, as different
annotators might make different decisions, and as other researchers attempting
to replicate the research will not know how we categorized such cases.
Second, the variable should be defined such that it does not conflate proper-
ties that are potentially independent of each other, as this will lead to a set of
values that do not fall along a single dimension. As an example, consider the
so-called Silverstein Hierarchy used to categorize nouns for (inherent) Topicality
(after Deane 1987: 67):
Note, first, that there is a lot of overlap in this annotation scheme. For example,
a first or second person pronoun will always refer to a human or animate NP
and a third person pronoun will frequently do so, as will a proper name or a kin
term. Similarly, a container is a concrete object and can also be a location, and
everything above the category “Perceivable” is also perceivable. This overlap can
only be dealt with by an instruction of the kind that every nominal expression
should be put into the topmost applicable category; in other words, we need to
add an “except for expressions that also fit into one of the categories above” to
every category label.
Secondly, although the Silverstein Hierarchy may superficially give the im-
pression of providing values of a single variable that could be called Topicality,
it is actually a mixture of several quite different variables and their possible val-
ues. One attempt of disentangling these variables and giving them each a set of
plausible values is the following:
124
4.2 Annotating
Given this set of variables, it is possible to describe all categories of the Silver-
stein Hierarchy as a combination of values of these variables, for example:
There are two advantages of this more complex annotation scheme. First, it
allows a more principled categorization of individual expressions: the variables
and their values are easier to define and there are fewer unclear cases. Second,
it would allow us to determine empirically which of the variables are actually
relevant in the context of a given research question, as irrelevant variables will
not show a skew in their distribution across different conditions. Originally, the
Silverstein Hierarchy was meant to allow for a principled description of split
ergative systems; it is possible, that the specific conflation of variables is suitable
125
4 Data retrieval and annotation
for this task. However, it is an open question whether the same conflation of vari-
ables is also suitable for the analysis of other phenomena. If we were to apply
it as is, we would not be able to tell whether this is the case. Thus, we should
always define our variables in terms of a single dimension and deal with com-
plex concepts (like Topicality) by analyzing the data in terms of a set of such
variables.
After defining a variable (or set of variables) and deciding on the type and
number of values, the second step in creating an annotation scheme consists in
defining what belongs into each category. Where necessary, this should be done
in the form of a decision procedure.
For example, the annotation scheme for Animacy mentioned in Chapter 3
(Garretson 2004; Zaenen et al. 2004) has the categories human and organiza-
tion (among others). The category human is relatively self-explanatory, as we
tend to have a good intuition about what constitutes a human. Nevertheless, the
annotation scheme spells out that it does not matter by what linguistic means hu-
mans are referred to (e.g., proper names, common nouns including kinship terms,
and pronouns) and that dead, fictional or potential future humans are included
as well as “humanoid entities like gods, elves, ghosts, and androids”.
The category organization is much more complex to apply consistently,
since there is no intuitively accessible and generally accepted understanding of
what constitutes an organization. In particular, it needs to be specified what dis-
tinguishes an organization from other groups of human beings (that are to
be categorized as human according to the annotation scheme). The annotation
scheme defines an organization as a referent involving “more than one human”
with “some degree of group identity”. It then provides the following hierarchy of
properties that a group of humans may have (where each property implies the
presence of all properties below its position in the hierarchy):
(21) +/− chartered/official
+/− temporally stable
+/− collective voice/purpose
+/− collective action
+/− collective
It then states that “any group of humans at + collective voice or higher” should
be categorized as organization, while those below should simply be annotated
as human. By listing properties that a group must have to count as an organiza-
tion in the sense of the annotation scheme, the decision is simplified considerably,
and by providing a decision procedure, the number of unclear cases is reduced.
The annotation scheme also illustrates the use of the hierarchy:
126
4.2 Annotating
Thus, while “the posse” would be an org, “the mob” might not be, depending
on whether we see the mob as having a collective purpose. “The crowd”
would not be considered org, but rather simply human.
127
4 Data retrieval and annotation
not been assigned although it should have been, and a true negative would be a
case where the value has not been assigned and should not have been assigned.
This assumes, however, that we can determine with a high degree of certainty
what the correct value would be in each case. The examples discussed in this
chapter show, however, that this decision itself often involves a certain degree
of interpretation – even an explicit and detailed annotation scheme has to be ap-
plied by individuals based on their understanding of the instructions contained
in it and the data to which they are to be applied. Thus, a certain degree of sub-
jectivity cannot be avoided, but we need to minimize the subjective aspect of
interpretation as much as possible.
The most obvious way of doing this is to have (at least) two different annota-
tors apply the annotation scheme to the data – if our measurements cannot be
made objective (and, as should be clear by now, they rarely can in linguistics),
this will at least allow us to ensure that they are intersubjectively reliable.
One approach would be to have the entire data set annotated by two annota-
tors independently on the basis of the same annotation scheme. We could then
identify all cases in which the two annotators did not assign the same value
and determine where the disagreement came from. Obvious possibilities include
cases that are not covered by the annotation scheme at all, cases where the def-
initions in the annotation scheme are too vague to apply or too ambiguous to
make a principled decision, and cases where one of the annotators has misun-
derstood the corpus example or made a mistake due to inattention. Where the
annotation scheme is to blame, it could be revised accordingly and re-applied to
all unclear cases. Where an annotator is at fault, they could correct their anno-
tation decision. At the end of this process we would have a carefully annotated
data set with no (or very few) unclear cases left.
However, in practice there are two problems with this procedure. First, it is
extremely time-consuming, which will often make it difficult to impossible to
find a second annotator. Second, discussing all unclear cases but not the appar-
ently clear cases holds the danger that the former will be annotated according to
different criteria than the latter.
Both problems can be solved (or at least alleviated) by testing the annotation
scheme on a smaller dataset using two annotators and calculating its reliability
across annotators. If this so-called interrater reliability is sufficiently high, the
annotation scheme can safely be applied to the actual data set by a single anno-
tator. If not, it needs to be made more explicit and applied to a new set of test data
by two annotators; this process must be repeated until the interrater reliability
is satisfactory.
128
4.2 Annotating
(22) 𝑝𝑜 − 𝑝𝑒
𝜅=
1 − 𝑝𝑒
In this formula, 𝑝𝑜 is the relative observed agreement between the raters (i.e.
the percentage of cases where both raters have assigned the same category) and
𝑝𝑒 is the relative expected agreement (i.e. the percentage of cases where they
should have agreed by chance).
Table 4.3 shows a situation where the two raters assign one of the two cate-
gories x or y. Here, 𝑝𝑜 would be the sum of n(x, x) and n(y, y), divided by the
sum of all annotation; 𝑝𝑒 can be calculated in various ways, a straightforward
one will be introduced below.
Table 4.3: A contingency table for two raters and two categories
Rater 2
category x Category x
Rater 1 category x n(x, x) n(x, y)
category y n(y, x) n(y, y)
Let us assume that we want to investigate the factors determining the choice
between these two constructions (as we will do in Chapters 5 and 6). In order
to do so, we need to identify the subset of constructions with of that actually
4
For more than two raters, there is a more general version of this metric, referred to as Fleiss’
𝜅 (Fleiss 1971), but as it is typically difficult even to find a second annotator, we will stick with
the simpler measure here.
129
4 Data retrieval and annotation
130
4.2 Annotating
131
4 Data retrieval and annotation
132
4.2 Annotating
Rater 2
poss other Total
Rater 1 poss 18 2 20
other 1 9 10
Total 19 11 30
We can now calculate the interrater reliability for the data in Table 4.4 using
the formula in (22) above:
0.7667 − 0.5444
𝜅= = 0.7805
1 − 0.5444
There are various suggestions as to what value of 𝜅 is to be taken as satisfac-
tory. One reasonable suggestion (following McHugh 2012) is shown in Table 4.6;
according to this table, our annotation scheme is good enough to achieve “strong”
agreement between raters, and hence presumably good enough to use in a cor-
pus linguistic research study (what is “good enough” obviously depends on the
risk posed by cases where there is no agreement in classification).
Table 4.6: Interpretation of 𝜅 values
𝜅 Level of agreement
0–0.20 None
0.21–0.39 Minimal
0.40–0.59 Weak
0.60–0.79 Moderate
0.80–0.90 Strong
> 0.90 Almost Perfect
4.2.4 Reproducibility
Scientific research is collaborative and incremental in nature, with researchers
building on and extending each other’s work. As discussed in the previous chap-
ter in Section 3.3, this requires that we be transparent with respect to our data
133
4 Data retrieval and annotation
and methods to an extent that allows other researchers to reproduce our results.
This is referred to as reproducibility and/or (with a different focus, replicability)
– since there is some variation in how these terms are used, let us briefly switch
to more specific, non-conventional terminology.
The minimal requirement of an incremental and collaborative research cycle
is what we might call retraceability: our description of the research design and
the associated procedures must be explicit and detailed enough for another re-
searcher to retrace and check for correctness each step of our analysis and when
provided with all our research materials (i.e., the corpora, the raw extracted data
and the annotated data) and all other resources used (such as our annotation
scheme and the software used in the extraction and statistical analysis of the
data) and all our research notes, intermediate calculations, etc. In other words,
our research project must be documented in sufficient detail for others to make
sure that we arrived at our results via the procedures that we claim to have used,
and to identify possible problems in our data and/or procedure. Thus retraceabil-
ity is closely related to the idea of accountability in accounting.
Going slightly beyond retraceability, we can formulate a requirement which
we might call reconstructibility: given all materials and resources, the description
of our procedure must be detailed enough to ensure that a researcher indepen-
dently applying this procedure to the same data using the same resources, but
without access to our research notes, intermediate results, etc., should arrive at
the same result.
Both of the requirements just described fall within the range of what is re-
ferred to as reproducibility. Obviously, as long as our data and other resources
(and, if relevant, our research notes and intermediate results) are available, repro-
ducibility is largely a matter of providing a sufficiently explicit and fine-grained
description of the steps by which we arrived at our results. However, in actual
practice it is surprisingly difficult to achieve reproducibility even if our research
design does not involve any manual data extraction or annotation (see Stefanow-
itsch (2018) for an attempt). As soon as our design includes steps that involve
manual extraction or annotation, it may still be retraceable, but it will no longer
be reconstructible; however, if we ensure that our extraction methods and anno-
tation scheme(s) have a high interrater reliability, an attempt at reconstruction
should at least lead to very similar results.
Matters become even more difficult if our data or other resources are not acces-
sible, for example, if we use a corpus or software constructed specifically for our
research project that we cannot share publicly due to copyright restrictions, or
if our corpus contains sensitive information such that sharing it would endanger
individuals, violate non-disclosure agreements, constitute high treason, etc.
134
4.2 Annotating
135
4 Data retrieval and annotation
136
4.2 Annotating
The second option is more typically chosen in the case of annotations added in
the context of a specific research project (especially if they are added manually):
the data are extracted, stored in a separate file, and then annotated. Frequently,
spreadsheet applications are used to store the corpus data and annotation deci-
sions, as in the example in Figure 4.1, where possessive pronouns and nouns are
annotated for Nominal Type, Animacy and Concreteness:
A B C D E
1 Example Source File Word Nom. Type Animacy
2 Jess's horse N 12 Jess proper name human
3 its stall N 12 it pronoun animate
4 Diane's information N 12 Diane proper name human
A B C D E
1 Noun Type pronoun proper name noun
2 1 2 0
3 Animacy human animate inanimate
4 2 1 0
137
4 Data retrieval and annotation
From a corpus annotated in this way, we can always create a raw data list like
that in Figure 4.1 by searching for possessives and then separating the hits into
the word itself, the Part-of-Speech label and the Animacy annotation (this can
be done manually, or with the help of regular expressions in a text editor or with
a few lines of code in a scripting language like Perl or Python).
The advantage would be that we, or other researchers, could also use our an-
notated data for research projects concerned with completely different research
questions. Thus, if we are dealing with a variable that is likely to be of general
interest, we should consider the possibility of annotating the corpus itself, in-
stead of first extracting the relevant data to a raw data table and annotating
them afterwards. While the direct annotation of corpus files is rare in corpus
138
4.2 Annotating
linguistics, it has become the preferred strategy in various fields concerned with
qualitative analysis of textual data. There are open-source and commercial soft-
ware packages dedicated to this task. They typically allow the user to define a
set of annotation categories with appropriate codes, import a text file, and then
assign the codes to a word or larger textual unit by selecting it with the mouse
and then clicking a button for the appropriate code that is then added (often
in XML format) to the imported text. This strategy has the additional advantage
that one can view one’s annotated examples in their original context (which may
be necessary when annotating additional variables later). However, the available
software packages are geared towards the analysis of individual texts and do not
let the user to work comfortably with large corpora.
139
5 Quantifying research questions
Recall, once again, that at the end of Chapter 2, we defined corpus linguistics as
the investigation of linguistic research questions that have been framed in
terms of the conditional distribution of linguistic phenomena in a linguistic
corpus.
We discussed the fact that this definition covers cases of hypotheses phrased
in absolute terms, i.e. cases where the distribution of a phenomenon across dif-
ferent conditions is a matter of all or nothing (as in “All speakers of American
English refer to the front window of a car as windshield; all speakers of British
English refer to it as windscreen”) as well as cases where the distribution is a
matter of more-or-less (as in “British English speakers prefer the word railway
over railroad when referring to train tracks; American English speakers prefer
railroad over railway” or “More British speakers refer to networks of train tracks
as railway instead of railroad; more American English speakers refer to them as
railroad instead of railway”).
In the case of hypotheses stated in terms of more-or-less, predictions must be
stated in quantitative terms which in turn means that our data have to be quanti-
fied in some way so that we can compare them to our predictions. In this chapter,
we will discuss in more detail how this is done when dealing with different types
of data.
Specifically, we will discuss three types of data (or levels of measurement) that
we might encounter in the process of quantifying the (annotated) results of a
corpus query (Section 5.1): nominal data (discussed in more detail in Section 5.2),
ordinal (or rank) data (discussed in more detail in Section 5.3, and cardinal data
(discussed in more detail in Section 5.4. These discussions, summarized in Sec-
tion 5.5, will lay the ground work for the introduction to statistical hypothesis
testing presented in the next chapter.
(b) Animacy Since animate referents tend to be more topical than inanimate ones
and more topical elements tend to precede less topical ones, if the modifier is
animate, the s-possessive will be preferred, if it is inanimate, the construction
with of will be preferred (cf. Quirk et al. 1972: 192–203; Deane 1987):
142
5.1 Types of data
(c) Length Since short constituents generally precede long constituents, if the
modifier is short, the s-possessive will be preferred, if it is long, the construction
with of will be preferred (Altenberg 1980):
In the case of all three factors, we are dealing with hypotheses concerning
preferences rather than absolute difference. None of the examples with ques-
tion marks are ungrammatical and all of them could conceivably occur; they just
sound a little bit odd. Thus, the predictions we can derive from each hypothesis
must be stated and tested in terms of relative rather than absolute differences –
they all involve predictions stated in terms more-or-less rather than all-or-noth-
ing. Relative quantitative differences are expressed and dealt with in different
ways depending on the type of data they involve.
143
5 Quantifying research questions
on the basis of any intrinsic property of German and French. They are simply two
different manifestations of the phenomenon Language, part of an unordered set
including all human languages.
That we cannot rank them based on intrinsic criteria does not mean that we
cannot rank them at all. For example, we could rank them by number of speak-
ers worldwide (in which case, as the numbers cited above show, German ranks
above French). We could also rank them by the number of countries in which
they are an official language (in which case French, which has official status in
29 countries, ranks above German, with an official status in only 6 countries). But
the number of native speakers or the number of countries where a language has
an official status is not an intrinsic property of that language – German would
still be German if its number of speakers was reduced by half by an asteroid
strike, and French would still be French if it lost its official status in all 29 coun-
tries). In other words, we are not really ranking french and german as values of
Language at all; instead, we are ranking values of the variables Size of Native
Speech Community and Number of Countries with Official Language X
respectively.
We also cannot calculate mean values (“averages”) between the values of nom-
inal variables. We cannot claim, for example, that Javanese is the mean of Ger-
man and French because the number of Javanese native speakers falls (roughly)
halfway between that of German and French native speakers. Again, what we
would be calculating a mean of is the values of the variable Size of Native
Speech Community, and while it makes a sort of sense to say that the mean
of the values number of french native speakers and number of german na-
tive speakers was 83.5 in 2005, it does not make sense to refer to this mean as
number of javanese speakers.
With respect to the three hypotheses concerning the distribution of the s-
possessive and the of -possessive, it is obvious that they all involve at least one
nominal variable – the constructions themselves. These are essentially values of
a variable we could call Type of Possessive Construction. We could catego-
rize all grammatical expressions of possession in a corpus in terms of the values
s-possessive and of-possessive, count them and express the result in terms of
absolute or relative frequencies. For example, the s-possessive occurs 22 193 times
in the BROWN corpus (excluding proper names and instances of the double s-
possessive), and the of -possessive occurs 17 800 times.1
1
This is an estimate; it would take too long to go through all 36 406 occurrences of of and
identify those that occur in the structure relevant here, so I categorized a random subsample
of 500 hits of of and generalized the proportion of of -possessives vs. other uses of of to the
total number of hits for of.
144
5.1 Types of data
As with the example of the variable Native Language above, we can rank the
constructions (i.e., the values of the variable Type of Possessive Construction)
in terms of their frequency (the s-possessive is more frequent), but again we are
not ranking these values based on an intrinsic criterion but on an extrinsic one:
their corpus frequency in one particular corpus. We can also calculate their mean
frequency (19 996.50), but again, this is not a mean of the two constructions, but
of their frequencies in one particular corpus.
145
5 Quantifying research questions
than a BA, the three degrees also differ in terms of specialization (from a rela-
tively broad BA to a very narrow PhD), and the PhD degree differs from the two
other degrees qualitatively: a BA and an MA primarily show that one has ac-
quired knowledge and (more or less practical skills), but a PhD primarily shows
that one has acquired research skills.
With respect to the three hypotheses concerning the distribution of the s-pos-
sessive and the of -possessive, clearly Animacy is an ordinal variable, at least if
we think of it in terms of a scale, as we did in Chapter 3, Section 3.2. Recall that
a simple animacy scale might look like this:
On this scale, animate ranks higher than inanimate which ranks higher than
abstract in terms of the property we are calling Animacy, and this ranking is
determined by the scale itself, not by any extrinsic criteria.
This means that we could categorize and rank all nouns in a corpus according
to their animacy. But again, we cannot calculate a mean. If we have 50 human
nouns and 50 abstract nouns, we cannot say that we have 100 nouns with a
mean value of inanimate. Again, this is because we have no way of knowing
whether, in terms of animacy, the difference between animate and inanimate
is the same quantitatively as that between inanimate and abstract, but also,
because we are, again, dealing with qualitative as well as quantitative differences:
the difference between animate and inanimate on the one hand and abstract on
the other is that the first two have physical existence; and the difference between
animate on the one hand and inanimate and abstract on the other is that animates
are potentially alive and the other two are not. In other words, our scale is really
a combination of at least two dimensions.
Again, we could ignore the intrinsic order of the values on our Animacy scale
and simply treat them as nominal data, i.e., count them and report the frequency
with which each value occurs in our data. Potentially ordinal data are actually
frequently treated like nominal data in corpus linguistics (cf. Section 5.3.2), and
with complex scales combining a range of different dimensions, this is proba-
bly a good idea; but ordinal data also have a useful place in quantitative corpus
linguistics.
146
5.1 Types of data
but because of their nature as numbers. Also, the distance between any two mea-
surements is precisely known and can directly be expressed as a number itself.
This means that we can perform any arithmetic operation on cardinal data – cru-
cially, we can calculate means. Of course, we can also treat cardinal data like
rank data by ignoring all of their mathematical properties other than their order,
and we can also treat them as nominal data.
Typical cases of cardinal variables are demographic variables like Age or In-
come. For example, we can categorize a sample of speakers by their age and then
calculate the mean age of our sample. If our sample contains five 50-year-olds and
five 30-year-olds, it makes perfect sense to say that the mean age in our sample
is 40; we might need additional information to distinguish between this sample
and another sample that consists of 5 41-year-olds and 5 39-year-olds, that would
also have a mean age of 40 (cf. Chapter 6), but the mean itself is meaningful, be-
cause the distance between 30 and 40 is the same as that between 40 and 50 and
all measurements involve just a single dimension (age).
With respect to the two possessives, the variables Length and Givenness are
cardinal variables. It should be obvious that we can calculate the mean length of
words or other constituents in a corpus, a particular sample, a particular position
in a grammatical construction, etc.
As mentioned above, we can also treat cardinal data like ordinal data. This
may sometimes actually be necessary for mathematical reasons (see Chapter 6
below); in other cases, we may want to transform cardinal data to ordinal data
based on theoretical considerations.
For example, the measure of Referential Distance discussed in Chapter 3, Sec-
tion 3.2 yields cardinal data ranging from 0 to whatever maximum distance we
decide on and it would be possible, and reasonable, to calculate the mean refer-
ential distance of a particular type of referring expression. However Givón (1992:
20f) argues that we should actually think of referential distance as ordinal data:
as most referring expressions consistently have a referential distance of either
0–1, or 2–3, or larger than 3, he suggests converting measures of Referential
Distance into just three categories: minimal gap (0–1), small gap (2–3) and
long gap (> 3). Once we have done this, we can no longer calculate a mean, be-
cause the categories are no longer equivalent in size or distance, but we can still
rank them. Of course, we can also treat them as nominal data, simply counting
the number of referring expressions in the categories minimal gap, small gap
and long gap.
147
5 Quantifying research questions
148
5.2 Descriptive statistics for nominal data
Note that the terms s-possessive and of-possessive are typeset in small caps
in these hypotheses. This is done in order to show that they are values of a vari-
able in a particular research design, based on a particular theoretical construct.
As such, these values must, of course, be given operational definitions (also, the
construct upon which the variable is based should be explicated with reference
to a particular model of language, but this would lead us too far from the pur-
pose of this chapter and so I will assume that the phenomenon “English nominal
possession” is self-explanatory).
The definitions I used were the following:
149
5 Quantifying research questions
(11) Prediction: There will be more cases of the s-possessive with discourse-
old modifiers than with discourse-new modifiers, and more cases of the
of-possessive with discourse-new modifiers than with discourse-old
modifiers.
Table 5.1 shows the absolute frequencies of the parts of speech of the modifier
in both constructions (examples with proper names were discarded, as the given-
ness of proper names in discourse is less predictable than that of pronouns and
common nouns).
Such a table, examples of which we have already seen in previous chapters,
is referred to as a contingency table. In this case, the contingency table consists
150
5.2 Descriptive statistics for nominal data
Table 5.1: Part of speech of the modifier in the s-possessive and the
of -possessive
Possessive
s-possessive of-possessive Total
Givenness old 180 3 183
new 20 153 173
Total 200 156 356
of four cells showing the frequencies of the four intersections of the variables
Givenness, (with the values new, i.e. “pronoun”, and old, i.e. “common noun”
and Possessive (with the values s and of); in other words, it is a two-by-two
table. Possessive is presented as the dependent variable here, since logically the
hypothesis is that the information status of the modifier influences the choice of
construction, but mathematically it does not matter in contingency tables what
we treat as the dependent or independent variable.
In addition, there are two cells showing the row totals (the sum of all cells in
a given row) and the column totals (the sum of all cells in a given column), and
one cell showing the table total (the sum of all four intersections). The row and
column totals for a given cell are referred to as the marginal frequencies for that
cell.
5.2.1 Percentages
The frequencies in Table 5.1 are fairly easy to interpret in this case, because the
differences in frequency are very clear. However, we should be wary of basing
our assesment of corpus data directly on raw frequencies in a contingency table.
These can be very misleading, especially if the marginal frequencies of the vari-
ables differ substantially, which in this case, they do: the s-possessive is more fre-
quent overall than the of -possessive and discourse-old modifiers (i.e., pronouns)
are slightly more frequent overall than discourse-new ones (i.e., common nouns).
Thus, it is generally useful to convert the absolute frequencies to relative fre-
quencies, abstracting away from the differences in marginal frequencies. In order
to convert an absolute frequency n into a relative one, we simply divide it by the
total number of cases N of which it is a part. This gives us a decimal fraction
expressing the frequency as a proportion of 1. If we want a percentage instead,
we multiply this decimal fraction by 100, thus expressing our frequency as a pro-
portion of 100.
151
5 Quantifying research questions
152
5.2 Descriptive statistics for nominal data
Possessive
s-possessive of-possessive Total
Discourse old Abs. 180 3 183
Status Rel. (Col.) 0.9000 0.0192 –
Rel. (Row) 0.9836 0.0164 1.0000
Rel. (Tab.) 0.5056 0.0084 0.5140
new Abs. 20 153 173
Rel. (Col.) 0.1000 0.9808 –
Rel. (Row) 0.1156 0.8844 1.0000
Rel. (Tab.) 0.0562 0.4298 0.4860
Total Abs. 200 156 356
Rel. (Col.) 1.0000 1.0000 1.0000
Rel. (Row) – – 1.0000
Rel. (Tab.) 0.5618 0.4382 1.0000
153
5 Quantifying research questions
could imagine a situation, for example, where 90 percent of the cases fell into
the intersection s-possessive ∩ discourse-old and 10 percent in the intersec-
tion of-possessive ∩ discourse-new – this would still be a corroboratation of
our hypothesis.
While relative frequencies (whether expressed as decimal fractions or as per-
centages) are, with due care, more easily interpretable than absolute frequencies,
they have two disadvantages. First, by abstracting away from the absolute fre-
quencies, we lose valuable information: we would interpret a distribution such
as that in Table 5.3 differently, if we knew that it was based on a sample on just
35 instead of 356 corpus hits. Second, it provides no sense of how different our
observed distribution is from the distribution that we would expect if there was
no relation between our two variables, i.e., if the values were distributed ran-
domly. Thus, instead of (or in addition to) using relative frequencies, we should
compare the observed absolute frequencies of the intersections of our variables
with the expected absolute frequencies, i.e., the absolute frequencies we would
expect if there was a random relationship between the variables. This compari-
son between observed and expected frequencies also provides a foundation for
inferential statistics, discussed in Chapter 6.
154
5.2 Descriptive statistics for nominal data
were distributed randomly, each intersection of values should have about the
same frequency (just like, when tossing a coin, each side should come up roughly
the same number of times). However, this would only be the case if all marginal
frequencies were the same, for example, if our sample contained fifty s-posses-
sives and fifty of-possessives and fifty of the modifiers were discourse old (i.e.
pronouns) and fifty of them were discourse-new (i.e. common nouns). But this
is not the case: there are more discourse-old modifiers than discourse-new ones
(183 vs. 173) and there are more s-possessives than of -possessives (200 vs. 156).
These marginal frequencies of our variables and their values are a fact about
our data that must be taken as a given when calculating the expected frequencies:
our hypothesis says nothing about the overall frequency of the two construc-
tions or the overall frequency of discourse-old and discourse-new modifiers, but
only about the frequencies with which these values should co-occur. In other
words, the question we must answer is the following: Given that the s- and the
of -possessive occur 200 and 156 times respectively and given that there are 183
discourse-old modifiers and 173 discourse-new modifiers, how frequently would
each combination these values occur by chance?
Put like this, the answer is conceptually quite simple: the marginal frequencies
should be distributed across the intersections of our variables such that the rela-
tive frequencies in each row should be the same as those of the row total and the
relative frequencies in each column should be the same as those of the column
total.
For example, 56.18 percent of all possessive constructions in our sample are
s-possessives and 43.82 percent are of -possessives; if there were a random rela-
tionship between type of construction and givenness of the modifier, we should
find the same proportions for the 183 constructions with old modifiers, i.e. 183 ×
0.5618 = 102.81 s-possessives and 183 × 0.4382 = 80.19 of -possessives. Like-
wise, there are 173 constructions with new modifiers, so 173 × 0.5618 = 97.19
of them should be s-possessives and 173 × 0.4382 = 75.81 of them should be of -
possessives. The same goes for the columns: 51.4 percent of all constructions
have old modifiers and 41.6 percent have new modifiers. If there were a ran-
dom relationship between type of construction and givenness of the modifier,
we should find the same proportions for both types of possessive construction:
there should be 200 × 0.514 = 102.8 s-possessives with old modifiers and 97.2
with new modifiers, as well as 156 × 0.514 = 80.18 of -possessives with old mod-
ifiers and 156 × 0.486 = 75.82 of -possessives with new modifiers. Note that the
expected frequencies for each intersection are the same whether we use the total
row percentages or the total column percentages: the small differences are due
to rounding errors.
155
5 Quantifying research questions
To avoid rounding errors, we should not actually convert the row and column
totals to percentages at all, but use the following much simpler way of calcu-
lating the expected frequencies: for each cell, we simply multiply its marginal
frequencies and divide the result by the table total as shown in Table 5.3; note
that we are using the standard convention of using O to refer to observed fre-
quencies, E to refer to expected frequencies, and subscripts to refer to rows and
columns. The convention for these subscripts is as follows: use 1 for the first row
or column, 2 for the second row or column, and T for the row or column total,
and give the index for the row before that of the column. For example, E21 refers
to the expected frequency of the cell in the second row and the first column, O 1T
refers to the total of the first row, and so on.
Table 5.3: Calculating expected frequencies from observed frequencies
Dependent Variable
value 1 value 2 Total
𝑂T1 × 𝑂1T 𝑂T2 × 𝑂1T
Independent value 1 𝐸11 = 𝐸12 = 𝑂1T
𝑂TT 𝑂TT
Variable
𝑂T1 × 𝑂2T 𝑂T2 × 𝑂2T
value 2 𝐸21 = 𝐸22 = 𝑂2T
𝑂TT 𝑂TT
Total 𝑂T1 𝑂T2 𝑂TT
Applying this procedure to our observed frequencies yields the results shown
in Table 5.4. One should always report nominal data in this way, i.e., giving both
the observed and the expected frequencies in the form of a contingency table.
We can now compare the observed and expected frequencies of each inter-
section to see whether the difference conforms to our quantitative prediction.
This is clearly the case: for the intersections s-possessive ∩ discourse-old and
of-possessive ∩ discourse-new, the observed frequencies are higher than the
expected ones, for the intersections s-possessive ∩ discourse-new and of-pos-
sessive ∩ discourse-old, the observed frequencies are lower than the expected
ones.
This conditional distribution seems to corroborate our hypothesis. However,
note that it does not yet prove or disprove anything, since, as mentioned above,
we would never expect a real-world distribution of events to match the expected
distribution perfectly. We will return to this issue in Chapter 6.
156
5.3 Descriptive statistics for ordinal data
Table 5.4: Observed and expected frequencies of old and new modifiers
in the s- and the of -possessive
Possessive
s-possessive of-possessive Total
Discourse old Obs. 180 3 183
Status Exp. 102.81 80.19
new Obs. 20 153 173
Exp. 97.19 75.81
Total Obs. 200 156 356
The constructions are operationalized as before. The data used are based on
the same data set, except that cases with proper names are now included. For
expository reasons, we are going to look at a ten-percent subsample of the full
sample, giving us 22 s-possessives and 17 of -possessives.
Animacy was operationally defined in terms of the annotation scheme shown
in Table 5.5 (based on Zaenen et al. 2004).
As pointed out above, Animacy hierarchies are a classic example of ordinal
data, as the categories can be ordered (although there may be some disagreement
about the exact order), but we cannot say anything about the distance between
one category and the next, and there is more than one conceptual dimension in-
volved (I ordered them according to dimensions like “potential for life”, “toucha-
bility” and “conceptual independence”).
157
5 Quantifying research questions
(13) Prediction: The modifiers of the s-possessive will tend to occur high on
the Animacy scale, the modifiers of the of-possessive will tend to occur
low on the Animacy scale.
Note that phrased like this, it is not yet a quantitative prediction, since “tend
to” is not a mathematical concept. While frequency for nominal data and mean (or
“average”) for cardinal data are used in everyday language with something close
to their mathematical meaning, we do not have an everyday word for dealing
with differences in ordinal data. We will return to this point presently, but first,
let us look at the data impressionistically. Table 5.6 shows the annotated sample
(cases are listed in the order in which they occurred in the corpus).
A simple way of finding out whether the data conform to our prediction would
be to sort the entire data set by the rank assigned to the examples and check
whether the s-possessives cluster near the top of the list and the of -possessives
near the bottom. Table 5.7 shows this ranking.
Table 5.7 shows that the data conform to our hypothesis: among the cases
whose modifiers have an animacy of rank 1 to 3, s-possessives dominate, among
those with a modifier of rank 4 to 10, of -possessives make up an overwhelming
majority.
However, we need a less impressionistic way of summarizing data sets coded
as ordinal variables, since not all data set will be as straightforwardly inter-
pretable as this one. So let us turn to the question of an appropriate descriptive
statistic for ordinal data.
158
5.3 Descriptive statistics for ordinal data
159
5 Quantifying research questions
Table 5.7: The annotated sample from Table 5.6 ordered by animacy
rank
(contd.)
Anim. Type No. Anim. Type No.
1 s (a 2) 4 of (b 13)
1 s (a 3) 5 s (a 11)
1 s (a 7) 5 of (b 5)
1 s (a 8) 5 of (b 11)
1 s (a 9) 5 of (b 16)
1 s (a 10) 5 of (b 17)
1 s (a 12) 5 of (b 18)
1 s (a 15) 6 s (a 4)
1 s (a 17) 6 of (b 9)
1 s (a 18) 7 of (b 3)
1 s (a 19) 7 of (b 14)
1 s (a 20) 8 of (b 1)
1 s (a 21) 9 of (b 12)
1 s (a 23) 10 of (b 6)
1 of (b 4) 10 of (b 8)
1 of (b 7) 10 of (b 10)
2 s (a 1) 10 of (b 15)
2 s (a 5)
2 s (a 6)
2 s (a 13)
2 s (a 14)
2 of (b 2)
3 s (a 16)
3 s (a 22)
160
5.3 Descriptive statistics for ordinal data
5.3.1 Medians
As explained above, we cannot calculate a mean for a set of ordinal values, but we
can do something similar. The idea behind calculating a mean value is, essentially,
to provide a kind of mid-point around which a set of values is distributed – it is
a so-called measure of central tendency. Thus, if we cannot calculate a mean, the
next best thing is to simply list our data ordered from highest to lowest and find
the value in the middle of that list. This value is known as the median – a value
that splits a sample or population into a higher and a lower portion of equal sizes.
For example, the rank values for the Animacy of our sample of s-possessives
are shown in Figure 5.1a. There are 23 values, thus the median is the twelfth
value in the series (marked by a dot labeled M) – there are 11 values above it and
eleven below it. The twelfth values in the series is a 1, so the median value of
s-possessive modifiers in our sample is 1 (or human).
M
(a)
1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 3 3 5 6
M
(b)
1 1 2 4 5 5 5 5 5 6 7 7 8 9 10 10 10 10
Figure 5.1: Medians for (a) the s-possessives and (b) the of -possessives
in Table 5.7
(14) Prediction: The modifiers of the s-possessive will have a higher median
on the Animacy scale than the modifiers of the of-possessive.
Our data conform to this prediction, as 1 is higher on the scale than 5.5. As
before, this does not prove or disprove anything, as, again, we would expect
some random variation. Again, we will return to this issue in Chapter 6.
161
5 Quantifying research questions
This table also nicely shows the preference of the s-possessive for animate
modifiers (human, organization, other animate) and the preference of the of -
possessive for the categories lower on the hierarchy. The table also shows, how-
ever, that the modifiers of the of -possessive are much more evenly distributed
across the entire Animacy scale than those of the s-possessive.
For completeness’ sake, let me point out that there is a third measure of cen-
tral tendency, that is especially suited to nominal data (but can also be applied
to ordinal and cardinal data): the mode. The mode is simply the most frequent
value in a sample, so the modifiers of the of -possessive have a mode of 5 (or
concrete touchable) and those of the s-possessive have a mode of 1 (or hu-
man) with respect to animacy (similarly, we could have said that the mode of
s-possessive modifiers is discourse-old and the mode of of -possessive modi-
fiers is discourse-new). There may be more than one mode in a given sample.
162
5.4 Descriptive statistics for cardinal data
For example, if we had found just a single additional modifier of the type ab-
stract in the sample above (which could easily have happened), its frequency
would also be five; in this case, the of -possessive modifier would have two modes
(concrete touchable and abstract).
The concept of mode may seem useful in cases where we are looking for a sin-
gle value by which to characterize a set of nominal data, but on closer inspection
it turns out that it does not actually tell us very much: it tells us what the most
frequent value is, but it does not tell us how much more frequent that value is
than the next most frequent one, how many other values occur in the data at all,
etc. Thus, it is always preferable to report the frequencies of all values, and, in
fact, I have never come across a corpus-linguistic study reporting modes.
(15) Assumption: Short items tend to occur toward the beginning of a constiu-
tent, long items tend to occur at the end.
Hypothesis: The s-possessive will be used with short modifiers, the of-
possessive will be used with long modifiers.
The constructions are operationalized as before. The data used are based on
the same data set as before, except that cases with proper names and pronouns
are excluded. The reason for this is that we already know from the first case
study that pronouns, which we used as an operational definition of old infor-
mation prefer the s-possessive. Since all pronouns are very short (regardless of
whether we measure their length in terms of words, syllables or letters), includ-
ing them would bias our data in favor of the hypothesis. This left 20 cases of the
s-possessive and 154 cases of the of -possessive. To get samples of roughly equal
size for expository clarity, let us select every sixth case of the of -possessive, giv-
ing us 25 cases (note that in a real study, there would be no good reason to create
such roughly equal sample sizes – we would simply use all the data we have).
The variable Length was defined operationally as “number of orthographic
words”. We can now state the following prediction:
163
5 Quantifying research questions
Table 5.9 shows the length of head and modifier for all cases in our sample.
5.4.1 Means
How to calculate a mean (more precisely, an arithmetic mean) should be common
knowledge, but for completeness’ sake, the formula is given in (17):
1 𝑛 𝑥1 + 𝑥2 + ... + 𝑥𝑛
(17) 𝑥 𝑎𝑟𝑖𝑡ℎ𝑚 = ∑ 𝑥𝑖 =
𝑛 𝑖=1 𝑛
164
5.4 Descriptive statistics for cardinal data
165
5 Quantifying research questions
5.5 Summary
We have looked at three case studies, one involving nominal, one ordinal and
one cardinal data. In each case, we were able to state a hypothesis and derive
a quantitative prediction from it. Using appropriate descriptive statistics (per-
centages, observed and expected frequencies, modes, medians and means), we
were able to determine that the data conform to these predictions – i.e., that the
quantitative distribution of the values of the variables Givenness (measured by
Part of Speech, Animacy and Length across the conditions s-possessive and
of-possessive fits the predictions formulated.
However, these distributions by themselves do not prove (or, more precisely,
fail to disprove) the hypotheses for two related reasons. First, the predictions are
stated in relative terms, i.e. in terms of more-or-less, but they do not tell us how
much more or less we should expect to observe. Second, we do not know, and
currently have no way of determining, whether the more-or-less that we observe
reflects real differences in distribution, or whether it falls within the range of
random variation that we always expect when observing tendencies. More gen-
erally, we do not know how to apply the Popperian all-or-nothing research logic
to quantitative predictions. All this will be the topic of the next chapter.
166
6 Significance testing
As discussed extensively in Chapter 3, scientific hypotheses that are stated in
terms of universal statements can only be falsified (proven to be false), but never
verified (proven to be true). This insight is the basis for the Popperian idea of a
research cycle where the researcher formulates a hypothesis and then attempts
to falsify it. If they manage to do so, the hypothesis has to be rejected and re-
placed by a new hypothesis. As long as they do not manage to do so, they may
continue to treat it as a useful working hypothesis. They may even take the re-
peated failure to falsify a hypothesis as corroborating evidence for its correctness.
If the hypothesis can be formulated in such a way that it could be falsified by a
counterexample (and if it is clear what would count as a counterexample), this
procedure seems fairly straightforward.
However, as also discussed in Chapter 3, many if not most hypotheses in cor-
pus linguistics have to be formulated in relative terms – like those introduced
in Chapter 5. As discussed in Section 3.1.2, individual counterexamples are irrel-
evant in this case: if my hypothesis is that most swans are white, this does not
preclude the existence of differently-colored swans, so the hypothesis is not fal-
sified if we come across a black swan in the course of our investigation. In this
chapter, we will discuss how relative statements can be investigated within the
scientific framework introduced in Chapter 3.
168
6.1 Statistical hypothesis testing
Once we have formulated our research hypothesis and the corresponding null
hypothesis in this way (and once we have operationalized the constructs used
in formulating them), we collect, annotate and quantify the relevant data, as dis-
cussed in the preceding chapter.
169
6 Significance testing
The crucial step in terms of statistical significance testing then consists in de-
termining whether the observed distribution differs from the distribution we
would expect if the null hypothesis were true – if the values of our variables
were distributed randomly in the data. Of course, it is not enough to observe a
difference – a certain amount of variation is to be expected even if there is no
relationship between our variables. As will be discussed in detail in the next sec-
tion, we must determine whether the difference is large enough to assume that
it does not fall within the range of variation that could occur randomly. If we are
satisfied that this is the case, we can (provisionally) reject the null hypothesis. If
not, we must (provisionally) reject our research hypothesis.
In a third step (or in parallel with the second step), we must determine whether
the data conform to our research hypothesis, or, more precisely, whether they
differ from the prediction of H0 in the direction predicted by H1 . If they do (for
example, if there are more white swans than black swans), we can (provisionally)
accept our research hypothesis, i.e., we can continue to use it as a working hy-
pothesis in the same way that we would continue to use an absolute hypothesis
in this way as long as we do not find a counterexample. If the data differ from
the prediction of H0 in the opposite direction to that predicted by our research
hypothesis – for example, if there are more black than white swans – we must,
of course, also reject our research hypothesis, and treat the unexpected result as
a new problem to be investigated further.
Let us now turn to a more detailed discussion of probabilities, random varia-
tion and how statistics can be used to (potentially) reject null hypotheses.
170
6.2 Probabilities and significance testing
would not be surprised if the coin came down heads six times and tails four times,
or even heads seven times and tails three times, but we might already be slightly
surprised if it came down heads eight times and tails only twice, and we would
certainly be surprised to get a series of ten heads and no tails.
Let us look at the reasons for this surprise, beginning with a much shorter
series of just two coin flips. There are four possible outcomes of such a series:
Obviously, none of these outcomes is more or less probable than the others:
since there are four possible outcomes, they each have a probability of 1/4 = 0.25
(i.e., 25 percent, we will be using the decimal notation for percentages from here
on). Alternatively, we can calculate the probability of each series by multiplying
the probability of the individual events in each series, i.e. 0.5 × 0.5 = 0.25.
Crucially, however, there are differences in the probability of getting a partic-
ular set of results (i.e, a particular number of heads and regardless of the order
they occur in): There is only one possibility of getting two heads (2a) and one
of getting two tails (2d), but there are two possibilities of getting one head and
one tail (2b, c). We calculate the probability of a particular set by adding up the
probabilities of all possible series that will lead to this set. Thus, the probabilities
for the sets {heads, heads} and {tails, tails} are 0.25 each, while the probability for
the set {heads, tails}, corresponding to the series heads–tails and tails–heads, is
0.25 + 0.25 = 0.5.
This kind of coin-flip logic (also known as probability theory), can be utilized
in evaluating quantitative hypotheses that have been stated in quantitative terms.
Take the larger set of ten coin flips mentioned at the beginning of this section:
now, there are eleven potential outcomes, shown in Table 6.1.
Again, these outcomes differ with respect to their probability. The third col-
umn of Table 6.1 gives us the number of different series corresponding to each
set.2 For example, there is only one way to get a set consisting of heads only: the
coin must come down showing heads every single time. There are ten different
ways of getting one heads and nine tails: The coin must come down heads the
2
You may remember having heard of Pascal’s triangle, which, among more sophisticated things,
lets us calculate the number of different ways in which we can get a particular combination of
heads and tails for a given number of coin flips: the third column of Table 6.1 corresponds to
line 11 of this triangle. If you don’t remember, no worries, we will not need it.
171
6 Significance testing
172
6.2 Probabilities and significance testing
extreme result of 0:10. So the probability that we are wrong in rejecting the null
hypothesis is 0.000977 + 0.009766 = 0.010743. In other words: the probability
that we are wrong in rejecting the null hypothesis is always the probability of
the observed result plus the probabilities of all results that deviate from the null
hypothesis even further in the direction of the observed frequency. This is called
the probability of error (or simply p-value) in statistics.
It must be mentioned at this point that some researchers (especially oppo-
nents of null-hypothesis statistical significance testing) disagree that p can be
interpreted as the probability that we are wrong in rejecting the null hypothesis,
raising enough of a controversy to force the American Statistical Association to
take an official stand on the meaning of p:
173
6 Significance testing
lead us to wrongly reject the null hypothesis, but the more studies we conduct
that allow us to reject the null hypothesis, the more justified we are in treating
them as corroborating our research hypothesis.
By convention, probability of error of 0.05 (five percent) is considered to be
the limit as far as acceptable risks are concerned in statistics – if 𝑝 < 0.05 (i.e.,
if p is smaller than five percent), the result is said to be statistically significant
(i.e., not due to chance), if it is larger, the result is said to be non-significant (i.e.,
likely due to chance). Table 6.2 shows additional levels of significance that are
conventionally recognized.
Table 6.2: Interpretation of 𝑝-values
Obviously, these cut-off points are largely arbitrary (a point that is often crit-
icized by opponents of null-hypothesis significance testing): it is strange to be
confident in rejecting a null hypothesis if the probability of being wrong in do-
ing so is five percent, but to refuse to reject it if the probability of being wrong
is six percent (or, as two psychologists put it: “Surely, God loves the .06 nearly
as much as the .05” (Rosnow & Rosenthal 1989: 1277)).
In real life, of course, researchers do not treat these cut-off points as absolute.
Nobody would simply throw away a set of carefully collected data as soon as
their calculations yielded a 𝑝-value of 0.06 or even 0.1. Some researchers actually
report such results, calling 𝑝-values between 0.05 and 0.10 “marginally signifi-
cant”, and although this is often frowned upon, there is nothing logically wrong
with it. Even the majority of researchers who are unwilling to report such re-
sults would take them as an indicator that additional research might be in order
(especially if there is a reasonable effect size, see further below).
They might re-check their operational definitions and the way they were ap-
plied, they might collect additional data in order to see whether a larger data
set yields a lower probability of error, or they might replicate the study with a
different data set. Note that this is perfectly legitimate, and completely in line
with the research cycle sketched out in Section 3.3 – provided we retain all of
our data. What we must not do, of course, is test different data sets until we find
one that gives us a significant result, and then report just that result, ignoring all
174
6.2 Probabilities and significance testing
attempts that did not yield significant results. What we must also not do is col-
lect an extremely large data set and then keep drawing samples from it until we
happen to draw one that gives us a significant result. These practices are some-
times referred to as p-hacking, and they constitute a scientific fraud (imagine a
a researcher who wants to corroborate their hypothesis that all swans are white
and does so by simply ignoring all black swans they find).
Clearly, what probability of error one is willing to accept for any given study
also depends on the nature of the study, the nature of the research design, and a
general disposition to take or avoid risk. If mistakenly rejecting the null hypoth-
esis were to endanger lives (for example, in a study of potential side-effects of a
medical treatment), we might not be willing to accept a 𝑝-value of 0.05 or even
0.01.
Why would collecting additional data be a useful strategy, or, more generally
speaking, why are corpus-linguists (and other scientists) often intent on making
their samples as large as possible and/or feasible? Note that the probability of
error depends not just on the proportion of the deviation, but also on the overall
size of the sample. For example, if we observe a series of two heads and eight
tails (i.e., twenty percent heads), the probability of error in rejecting the null hy-
pothesis is 0.000977 + 0.009766 + 0.043945 = 0.054688. However, if we observe
a series of four heads and sixteen tails (again, twenty percent heads), the prob-
ability of error would be roughly ten times lower, namely 0.005909. The reason
is the following: There are 1 048 576 possible series of twenty coin flips. There
is still only one way of getting one head and nineteen tails, so the probability
of getting one head and nineteen tails is 1/1048576 = 0.0000009536743; however,
there are already 20 ways of getting one tail and nineteen heads (so the proba-
bility is 20/1048576 = 0.000019), 190 ways of getting two heads and eighteen tails
(𝑝 = 190/1048576 = 0.000181), 1140 ways of getting three heads and seventeen tails
(𝑝 = 1140/1048576 = 0.001087) and 4845 ways of getting four heads and sixteen
tails (𝑝 = 4845/1048576 = 0.004621). And adding up these probabilities gives us
0.005909.
Most research designs in any discipline are more complicated than coin flip-
ping, which involves just a single variable with two values. However, it is theo-
retically possible to generalize the coin-flipping logic to any research design, i.e.,
calculate the probabilities of all possible outcomes and add up the probabilities
of the observed outcome and all outcomes that deviate from the expected out-
come even further in the same direction. Most of the time, however, this is only
a theoretical possibility, as the computations quickly become too complex to be
performed in a reasonable time frame even by supercomputers, let alone by a
standard-issue home computer or manually.
175
6 Significance testing
176
6.3 Nominal data: The chi-square test
coefficient or the Kendall tau rank correlation coefficient (if one or both of our
variables are ordinal).
I will not, in other words, do much more than scratch the surface of the vast
discipline of statistics. In the Study Notes to this chapter, there are a number of
suggestions for further reading that are useful for anyone interested in a deeper
understanding of the issues introduced here, and obligatory for anyone serious
about using statistical methods in their own research. While I will not be making
reference to any statistical software applications, such applications are necessary
for serious quantitative research; again, the Study Notes contain useful sugges-
tions where to look.
177
6 Significance testing
We already reported the observed and expected frequencies in Table 5.4, but
let us repeat them here as Table 6.3 for convenience in a slightly simplified form
that we will be using from now on, with the expected frequencies shown in paren-
theses below the observed ones.
Table 6.3: Observed and expected frequencies of old and new modifiers
in the s- and the of -possessive (= Table 5.4)
Possessive
s-possessive of-possessive Total
Discourse old 180 3 183
Status (102.81) (80.19)
new 20 153 173
(97.19) (75.81)
Total 200 156 356
In order to test our research hypothesis, we must show that the observed fre-
quencies differ from the null hypothesis in the direction of our prediction. We
already saw in Chapter 5 that this is the case: The null hypothesis predicts the
expected frequencies, but there are more cases of s-possessives with old modi-
fiers and of -possessives with new modifiers than expected. Next, we must apply
the coin-flip logic and ask the question: “Given the sample size, how surprising
is the difference between the expected frequencies (i.e., a perfectly random dis-
tribution) and the observed frequencies (i.e., the distribution we actually find in
our data)?”
178
6.3 Nominal data: The chi-square test
179
6 Significance testing
of 𝜒 2 (or simply 𝜒 2 components); the formulas for calculating the cell components
in this way are shown in Table 6.4.
Table 6.4: Calculating 𝜒 2 components for individual cells
Dependent Variable
value 1 value 2
(𝑂11 − 𝐸11 )2 (𝑂12 − 𝐸12 )2
Independent value 1
Variable 𝐸11 𝐸12
(𝑂21 − 𝐸21 )2 (𝑂22 − 𝐸22 )2
value 2
𝐸21 𝐸22
If we apply this procedure to Table 6.3, we get the components shown in Ta-
ble 6.5.
Table 6.5: 𝜒 2 components for Table 6.3
Possessive
s-possessive of-possessive
(180−102.81)2 (3−80.19)2
Discourse old = 57.96 = 74.3
102.81 80.19
Status
(20−97.19)2 (153−75.81)2
new = 61.31 = 78.6
97.19 75.81
The degree of deviance from the expected frequencies for the entire table can
then be calculated by adding up the 𝜒 2 components. For Table 7.3, the 𝜒 2 value
(𝜒 2 ) is 272.16. This value can now be used to determine the probability of error
by checking it against a table like that in Section 14.1 in the Statistical Tables at
the end of this book.
Before we can do so, there is a final technical point to make. Note that the
degree of variation in a given table that is expected to occur by chance depends
quite heavily on the size of the table. The bigger the table, the higher the number
of cells that can vary independently of other cells without changing the marginal
sums (i.e., without changing the overall distribution). The number of such cells
that a table contains is referred to as the number of degrees of freedom of the table.
In the case of a two-by-two table, there is just one such cell: if we change any
single cell, we must automatically adjust the other three cells in order to keep the
marginal sums constant. Thus, a two-by-two table has one degree of freedom.
180
6.3 Nominal data: The chi-square test
The general formula for determining the degrees of freedom of a table is the fol-
lowing, where 𝑁𝑟𝑜𝑤𝑠 is the number of rows and 𝑁𝑐𝑜𝑙𝑢𝑚𝑛 is the number of columns:
In the present case, the analysis might be summarized along the following
lines: “This study has shown that s-possessives are preferred when the modi-
fier is discourse-old while of -possessives are preferred when the modifier is dis-
course-new. The differences between the constructions are highly significant
(𝜒 2 = 272.16, df = 1, 𝑝 < 0.001)”.
A potential danger to this way of formulating the results is the meaning of
the word significant. In statistical terminology, this word simply means that the
results obtained in a study based on one particular sample are unlikely to be due
to chance and can therefore be generalized, with some degree of certainty, to the
entire population. In contrast, in every-day usage the word means something
along the lines of ‘having an important effect or influence’ (LDCE, s.v. signifi-
cant). Because of this every-day use, it is easy to equate statistical significance
with theoretical importance. However, there are at least three reasons why this
equation must be avoided.
First, and perhaps most obviously, statistical significance has nothing to do
with the validity of the operational definitions used in our research design. In
our case, this validity is reasonably high, provided that we limit our conclusions
to written English. As a related point, statistical significance has nothing to do
with the quality of our data. If we have chosen unrepresentative data or if we
have extracted or annotated our data sloppily, the statistical significance of the
results is meaningless.
181
6 Significance testing
272.16
𝜙= = 0.8744
√ 356
3
This problem cannot be dismissed as lightly as this example may suggest: it points to a fun-
damental difficulty in doing science. Note that if we did find that the font has an influence
on the choice of possessive, we would most likely dismiss this finding as a random fluke de-
spite its statistical significance. And we may well be right, since even a level of significance
of 𝑝 < 0.001 does not preclude the possibility that the observed frequencies are due to chance.
In contrast, an influence of the discourse status of the modifier makes sense because discourse
status has been shown to have effects in many areas of grammar, and thus we are unlikely to
question such an influence. In other words, our judgment of what is and is not plausible will
influence our interpretation of our empirical results even if they are statistically significant.
Alternatively, we could take every result seriously and look for a possible explanation, which
will then typically require further investigation. For example, we might hypothesize that there
is a relationship between font and level of formality, and the latter has been shown to have an
influence on the choice of possessive constructions (Jucker 1993).
4
This statement must be qualified to a certain degree: given the right research design, statis-
tical significance may actually be a very reasonable indicator of association strength (cf. e.g.
Stefanowitsch & Gries 2003, Gries & Stefanowitsch 2004 for discussion). However, in most con-
texts we are well advised to keep statistical significance and association strength conceptually
separate.
182
6.3 Nominal data: The chi-square test
Our 𝜙-value of 0.8744 falls into the very strong category, which is unusual in
uncontrolled observational research, and which suggests that Discourse Status
is indeed a very important factor in the choice of Possessive constructions in
English.
Exactly how much of the variance in the use of the two possessives is ac-
counted for by the discourse status of the modifier can be determined by looking
at the square of the 𝜙 coefficient: the square of a correlation coefficient gener-
ally tells us what proportion of the distribution of the dependent variable we
can account for on the basis of the independent variable (or, more generally,
what proportion of the variance our design has captured). In our case, 𝜙 2 =
(0.8744 × 0.8744) = 0.7645. In other words, the variable Discourse Status ex-
plains roughly three quarters of the variance in the use of the Possessive con-
structions – if, that is, our operational definition actually captures the discourse
status of the modifier, and nothing else. A more precise way of reporting the
results from our study would be something like the following “This study has
shown a strong and statistically highly significant influence of Discourse Sta-
tus on the choice of possessive construction: s-possessives are preferred when
the modifier is discourse-old (defined in this study as being realized by a pro-
noun) while of -possessives are preferred when the modifier is discourse-new
(defined in this study as being realized by a lexical NP) (𝜒 2 = 272.16, df = 1, 𝑝 <
0.001, 𝜙 2 = 0.7645)”.
183
6 Significance testing
184
6.3 Nominal data: The chi-square test
Observed Expected 𝜒2
(1317−1814)2
Sex female 1317 1814 = 136.17
1814
(2311−1814)2
male 2311 1814 = 136.17
1814
Total 3628 272.34
Observed Expected 𝜒2
(1317−1864.79)2
Sex female 1317 3628 × 0.514 = 1864.79 = 160.92
1864.79
(2311−1763.21)2
male 2311 3628 × 0.486 = 1763.21 = 170.19
1763.21
Total 3628 331.1
Clearly, the empirical distribution in this case closely resembles our hypoth-
esized equal distribution, and thus the results are very similar – since there are
slightly more women than men in the population, their underrepresentation in
the corpus is even more signficant.
Incidentally, the BNC not only contains speech by more male speakers than
female speakers, it also includes more speech by male than by female speakers:
185
6 Significance testing
men contribute 5 654 348 words, women contribute 3 825 804. I will leave it as an
exercise to the reader to determine whether and in what direction these frequen-
cies differ from what would be expected either under an assumption of equal
proportions or given the proportion of female and male speakers in the corpus.
In the case of speaker sex it does not make much of a difference how we derive
the expected frequencies, as men and women make up roughly half of the pop-
ulation each. For variables where such an even distribution of values does not
exist, the differences between these two procedures can be quite drastic. As an
example, consider Table 6.9, which lists the observed distribution of the speak-
ers in the spoken part of the BNC across age groups (excluding speakers whose
age is not recorded), together with the expected frequencies on the assumption
of equal proportions, and the expected frequencies based on the distribution of
speakers across age groups in the real world. The distribution of age groups in
the population of the UK between 1991 and 1994 is taken from the website of the
Office for National Statistics, averaged across the four years and cumulated to
correspond to the age groups recorded in the BNC.
Table 6.9: Observed and expected frequencies of Speaker Age in the
BNC
186
6.4 Ordinal data: The Mann-Whitney U-test
the differences between observed and expected are highly significant. However,
the distribution of age groups in the corpus is much closer to the assumption of
equal proportions than to the actual proportions in the population; also, the con-
clusions we will draw concerning the over- or underrepresentation of individual
categories will be very different. In the first case, for example, we might be led to
believe that the age group 34–44 is fairly represented while the age group 15–24
is underrepresented. In the second case, we see that in fact both age groups are
overrepresented. In this case, there is a clear argument for using empirically de-
rived expected frequencies: the categories differ in terms of the age span each of
them covers, so even if we thought that the distribution of ages in the population
is homogeneous, we would not expect all categories to have the same size.
The exact alternative to the univariate 𝜒 2 test with a two-level variable is the
binomial test, which we used (without calling it that), in our coin-flip example in
Section 6.2 above and which is included as a predefined function in many major
spreadsheet applications and in R; for one-by-n tables, there is a multinomial test
also available in R and other statistics packages.
187
6 Significance testing
also be 2. Recall that the observed median animacy in our sample was 1 for the s-
possessive and 5 for the of -possessive, which deviates from the prediction of the
H0 in the direction of our H1 . However, as in the case of nominal data, a certain
amount of deviation from the null hypothesis will occur due to chance, so we
need a test statistic that will tell us how likely our observed result is. For ordinal
data, this test statistic is the U value, which is calculated as follows.
In a first step, we have to determine the rank order of the data points in our
sample. For expository reasons, let us distinguish between the rank value and
the rank position of a data point: the rank value is the ordinal value it received
during annotation (in our case, its value on the Animacy scale), its rank position
is the position it occupies in an ordered list of all data points. If every rank value
occurred only once in our sample, rank value and rank position would be the
same. However, there are 41 data points in our sample, so the rank positions will
range from 1 to 41, and there are only 10 rank values in our annotation scheme for
Animacy. This means that at least some rank values will occur more than once,
which is a typical situation for corpus-linguistic research involving ordinal data.
Table 6.10 shows all data points in our sample together with their rank position.
Every rank value except 4, 8 and 9 occurs more than once; for example, there
are sixteen cases that have an Animacy rank value of 1 and six cases that have
a rank value of 2, two cases that have a rank value of 3, and so on. This means
we cannot simply assign rank positions from 1 to 41 to our examples, as there is
no way of deciding which of the sixteen examples with the rank value 1 should
receive the rank position 1, 2, 3, etc. Instead, these 16 examples as a group share
the range of ranks from 1 to 16, so each example gets the mean rank position of
this range. There are sixteen cases with rank value 1, to their mean rank is
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 11 + 12 + 13 + 14 + 15 + 16 136
= = 8.5
16 16
The first example with the rank value 2 occurs in line 17 of the table, so it
would receive the rank position 17. However, there are five more examples with
the same rank value, so again we calculate the mean rank position of the range
from rank 17 to 22, which is
17 + 18 + 19 + 20 + 21 + 22 117
= = 19.5
6 16
Repeating this process for all examples yields the rank positions shown in the
second column in Table 6.10.
188
6.4 Ordinal data: The Mann-Whitney U-test
Table 6.10: Annotated sample from Table 5.6 with animacy rank and
position (cf. Table 5.7)
(contd.)
Anim. Pos. Type No. Anim. Pos. Type No.
1 8.5 s (a 2) 4 25 of (b 13)
1 8.5 s (a 3) 5 28.5 s (a 11)
1 8.5 s (a 7) 5 28.5 of (b 5)
1 8.5 s (a 8) 5 28.5 of (b 11)
1 8.5 s (a 9) 5 28.5 of (b 16)
1 8.5 s (a 10) 5 28.5 of (b 17)
1 8.5 s (a 12) 5 28.5 of (b 18)
1 8.5 s (a 15) 6 32.5 s (a 4)
1 8.5 s (a 17) 6 32.5 of (b 9)
1 8.5 s (a 18) 7 34.5 of (b 3)
1 8.5 s (a 19) 7 34.5 of (b 14)
1 8.5 s (a 20) 8 36.0 of (b 1)
1 8.5 s (a 21) 9 37.0 of (b 12)
1 8.5 s (a 23) 10 39.5 of (b 6)
1 8.5 of (b 4) 10 39.5 of (b 8)
1 8.5 of (b 7) 10 39.5 of (b 10)
2 19.5 s (a 1) 10 39.5 of (b 15)
2 19.5 s (a 5)
2 19.5 s (a 6)
2 19.5 s (a 13)
2 19.5 s (a 14)
2 19.5 of (b 2)
3 23.5 s (a 16)
3 23.5 s (a 22)
189
6 Significance testing
Once we have determined the rank position of each data point, we separate
them into two subsamples corresponding to the values of the nominal variable
Type of Possessive again, as in Table 6.11. We then calculate rank sum R for each
group, which is simply the sum of their rank positions, and we count the number
of data points N in each group.
Table 6.11: Animacy ranks and positions and rank sums for the sample
of possessives
s-possessive of-possessive
Anim. Pos. Type Example Anim. Pos. Type Example
2 19.5 s (a 1) 8 36.0 of (b 1)
1 8.5 s (a 2) 2 19.5 of (b 2)
1 8.5 s (a 3) 7 34.5 of (b 3)
6 32.5 s (a 4) 1 8.5 of (b 4)
2 19.5 s (a 5) 5 28.5 of (b 5)
2 19.5 s (a 6) 10 39.5 of (b 6)
1 8.5 s (a 7) 1 8.5 of (b 7)
1 8.5 s (a 8) 10 39.5 of (b 8)
1 8.5 s (a 9) 6 32.5 of (b 9)
1 8.5 s (a 10) 10 39.5 of (b 10)
5 28.5 s (a 11) 5 28.5 of (b 11)
1 8.5 s (a 12) 9 37.0 of (b 12)
2 19.5 s (a 13) 4 25.0 of (b 13)
2 19.5 s (a 14) 7 34.5 of (b 14)
1 8.5 s (a 15) 10 39.5 of (b 15)
3 23.5 s (a 16) 5 28.5 of (b 16)
1 8.5 s (a 17) 5 28.5 of (b 17)
1 8.5 s (a 18) 5 28.5 of (b 18)
1 8.5 s (a 19)
1 8.5 s (a 20)
1 8.5 s (a 21)
3 23.5 s (a 22)
1 8.5 s (a 23)
R 324.5 R 536.5
N 23 N 18
190
6.5 Inferential statistics for cardinal data
The rank sum and the number of data points for each sample allow us to cal-
culate the U values for both group using the following simple formulas:
𝑁1 ×(𝑁1 +1)
(10) a. 𝑈1 = (𝑁1 × 𝑁2 ) + − 𝑅1
2
𝑁2 ×(𝑁2 +1)
b. 𝑈2 = (𝑁1 × 𝑁2 ) + − 𝑅2
2
Applying these formulas to the measures for the s-possessive (10a) and of -
possessive (10b) respectively, we get the U values
23 × (23 + 1)
𝑈1 = (23 × 18) + − 324.5 = 365.5
2
and
18 × (18 + 1)
𝑈2 = (23 × 18) + − 536.5 = 48.5
2
The U value for the entire data set is always the smaller of the two U values.
In our case this is 𝑈2 , so our U value is 48.5. This value can now be compared
against its known distribution in the same way as the 𝜒 2 value for nominal data.
In our case, this means looking it up in the table in Section 14.3 in the Statistical
Tables at the end of this book, which tells us that the 𝑝-value for this U value is
smaller than 0.001 – the difference between the s- and the of -possessive is, again,
highly significant. The Mann-Whitney U test may be reported as follows:
(11) Format for reporting the results of a Mann-Whitney test
(𝑈 = [𝑈 value], 𝑁1 = [𝑁1 ], 𝑁2 = [𝑁2 ], 𝑝 < (or >) [sig. level]).
Thus, we could report the results of this case study as follows: “This study
has shown that s-possessives are preferred when the modifier is high in animacy,
while of -possessives are preferred when the modifier is low in animacy. A Mann-
Whitney test shows that the differences between the constructions are highly
significant (𝑈 = 48.5, 𝑁1 = 18, 𝑁2 = 23, 𝑝 < 0.001)”.
191
6 Significance testing
(12) H1 : The s-possessive will be used with short modifiers, the of-possessive
will be used with long modifiers.
Prediction: The mean Length (in “number of words”) of modifiers of the
s-possessive should be smaller than that of the modifiers of the of-pos-
sessive.
Table 6.12 shows the length in number of words for the modifiers of the s-
and of -possessives (as already reported in Table 5.9), together with a number of
additional pieces of information that we will turn to next.
First, note that one case that was still included in Table 5.9 is missing: Example
(b 19) from that Table, which had a modifier of length 20. This is treated here as
a so-called outlier, i.e., a value that is so far away from the mean that it can
be considered an exception. There are different opinions on if and when outliers
should be removed that we will not discuss here, but for expository reasons alone
it is reasonable here to remove it (and for our results, it would not have made a
difference if we had kept it).
In order to calculate Welch’s t-test, we determine three values on the basis of
our measurements of Length: the number of measurements N, the mean length
for each group (𝑥), and a value called “sample variance” (𝑠 2 ). The number of
measurements is easy to determine – we just count the cases in each group: 20
s-possessives and 24 of -possessives. We already calculated the mean lengths in
Chapter 5: for the s-possessive, the mean length is 1.9 words, for the of -possessive
it is 3.83 words. As we already discussed in Chapter 5, this difference conforms
to our hypothesis: s-possessives are, on average, shorter than of -possessives.
The question is, again, how likely it is that this difference is due to chance.
When comparing group means, the crucial question we must ask in order to
determine this is how large the variation is within each group of measurements:
put simply, the more widely the measurements within each group vary, the more
likely it is that the differences across groups have come about by chance.
192
6.5 Inferential statistics for cardinal data
s-possessive of-possessive
No. Length (𝑥 − 𝑥) (𝑥 − 𝑥)2 No. Length (𝑥 − 𝑥) (𝑥 − 𝑥)2
(a 1) 2 0.10 0.01 (b 1) 3 −0.83 0.69
(a 2) 2 0.10 0.01 (b 2) 5 1.17 1.36
(a 3) 2 0.10 0.01 (b 3) 4 0.17 0.03
(a 4) 2 0.10 0.01 (b 4) 8 4.17 17.36
(a 5) 3 1.10 1.21 (b 5) 7 3.17 10.03
(a 6) 1 −0.90 0.81 (b 6) 1 −2.83 8.03
(a 7) 2 0.10 0.01 (b 7) 9 5.17 26.69
(a 8) 2 0.10 0.01 (b 8) 2 −1.83 3.36
(a 9) 2 0.10 0.01 (b 9) 5 1.17 1.36
(a 10) 2 0.10 0.01 (b 10) 6 2.17 4.69
(a 11) 1 −0.90 0.81 (b 11) 2 −1.83 3.36
(a 12) 2 0.10 0.01 (b 12) 2 −1.83 3.36
(a 13) 1 −0.90 0.81 (b 13) 1 −2.83 8.03
(a 14) 3 1.10 1.21 (b 14) 8 4.17 17.36
(a 15) 2 0.10 0.01 (b 15) 5 1.17 1.36
(a 16) 1 −0.90 0.81 (b 16) 2 −1.83 3.36
(a 17) 2 0.10 0.01 (b 17) 2 −1.83 3.36
(a 18) 2 0.10 0.01 (b 19) 2 −1.83 3.36
(a 19) 2 0.10 0.01 (b 20) 1 −2.83 8.03
(a 20) 2 0.10 0.01 (b 21) 2 −1.83 3.36
(b 22) 8 4.17 17.36
(b 23) 3 −0.83 0.69
(b 24) 2 −1.83 3.36
(b 25) 2 −1.83 3.36
N 20 N 24
Total 58 0 Total 92 0
𝑥 1.9 𝑥 3.8333
𝑠2 0.3053 𝑠2 6.67
193
6 Significance testing
The first step in assessing the variation consists in determining for each mea-
surement, how far away it is from its group mean. Thus, we simply subtract each
measurement for the s-possessive from the group mean of 1.9, and each measure-
ment for the of-possessive from the group mean of 3.83. The results are shown
in the third column of each sub-table in Table 6.12. However, we do not want
to know how much each single measurement deviates from the mean, but how
far the group s-possessive or of-possessive as a whole varies around the mean.
Obviously, adding up all individual values is not going to be helpful: as in the
case of observed and expected frequencies of nominal data, the result would al-
ways be zero. So we use the same trick we used there, and calculate the square
of each value – making them all positive and weighting larger deviations more
heavily. The results of this are shown in the fourth column of each sub-table. We
then calculate the mean of these values for each group, but instead of adding up
all values and dividing them by the number of cases, we add them up and divide
them by the total number of cases minus one. This is referred to as the sample
variance:
𝑛
∑𝑖=1 (𝑥𝑖 − 𝑥)2
(14) 𝑠2 =
𝑛−1
The sample variances themselves cannot be very easily interpreted (see further
below), but we can use them to calculate our test statistic, the t-value, using the
following formula (𝑥 stands for the group mean, s2 stands for the sample variance,
and N stands for the number of cases; the subscripts 1 and 2 indicate the two sub-
samples:
𝑥1 − 𝑥2
(15) 𝑡Welch =
𝑠12 𝑠22
+
√ 𝑁1 𝑁2
Note that this formula assumes that the measures with the subscript 1 are from
the larger of the two samples (if we don’t pay attention to this, however, all that
happens is that we get a negative t-value, whose negative sign we can simply
ignore). In our case, the sample of of -possessives is the larger one, giving us:
3.8333 − 1.9
𝑡Welch = = 3.5714
6.6667 0.3053
+
√ 24 20
As should be familiar by now, we compare this t-value against its distribution
to determine the probability of error (i.e., we look it up in the table in Section 14.4
194
6.5 Inferential statistics for cardinal data
in the Statistical Tables at the end of this book. Before we can do so, however, we
need to determine the degrees of freedom of our sample. This is done using the
following formula:
(16) 2
𝑠12 𝑠22
( + )
𝑁1 𝑁2
df ≈
𝑠14 𝑠24
2 + 2
𝑁1 df1 𝑁2 df2
Again, the subscripts indicate the sub-samples, s2 is the sample variance, and
𝑁 is the number of items the degrees of freedom for the two groups (df1 and df2 )
are defined as 𝑁 − 1. If we apply the formula to our data, we get the following:
6.6667 0.3053 2
( + )
24 20
df ≈ 6.66672 0.30532
= 25.5038
+ 2
242 ×(24−1) 20 ×(20−1)
As we can see in the table of critical values, the 𝑡-value is smaller than 0.01. A
t-test should be reported in the following format:
(17) Format for reporting the results of a t test
(𝑡([deg. freedom]) = [𝑡 value], 𝑝 < (or >) [sig. level]).
Thus, a straightforward way of reporting our results would be something like
this: “This study has shown that for modifiers that are realized by lexical NPs,
s-possessives are preferred when the modifier is short, while of -possessives are
preferred when the modifier is long. The difference between the constructions is
very significant (𝑡(25.50) = 3.5714, 𝑝 < 0.01)”.
As pointed out above, the value for the sample variance does not, in itself, tell
us very much. We can convert it into something called the sample standard devia-
tion, however, by taking its square root. The standard deviation is an indicator of
the amount of variation in a sample (or sub-sample) that is frequently reported;
it is good practice to report standard deviations whenever we report means.
Finally, note that, again, the significance level does not tell us anything about
the size of the effect, so we should calculate an effect size separately. The most
widely-used effect size for data analyzed with a t-test is Cohen’s d, also referred
to as the standardized mean difference. There are several ways to calculate it,
the simplest one is the following, where 𝜎 is the standard deviation of the entire
sample:
𝑥1 − 𝑥2
(18) 𝑑=
𝜎
195
6 Significance testing
196
6.5 Inferential statistics for cardinal data
0.0003
0.30
1.0
0.8
0.0002
0.20
0.6
0.4
0.0001
0.10
0.2
0.0000
0.00
0.0
197
6 Significance testing
more closely. However, this does not work in all cases (it would not, for example,
bring the distribution in Figure 6.1c much closer to a normal distribution, and
anyway, transforming data carries its own set of problems.
Thus, third, and most recommendably, we could try to find a way around hav-
ing to use a t-test in the first place. One way of avoiding a t-test is to treat our
non-normally distributed cardinal data as ordinal data, as described in Chapter 5.
We can then use the Mann-Whitney U -test, which does not require a normal dis-
tribution of the data. I leave it as an exercise to the reader to apply this test to
the data in Table 6.12 (you know you have succeeded if your result for U is 137,
𝑝 < 0.01).
Another way of avoiding the t-test is to find an operationalization of the phe-
nomenon under investigation that yields rank data, or, even better, nominal data
in the first place. We could, for example, code the data in Table 6.12 in terms of
a very simple nominal variable: Longer Constituent (with the variables head
and modifier). For each case, we simply determine whether the head is longer
than the modifier (in which case we assign the value head) or whether the mod-
ifier is longer than the head (in which case we assign the value modifier; we
discard all cases where the two have the same length. This gives us Table 6.13.
Table 6.13: The influence of length on the choice between the two pos-
sessives
Possessive
s-possessive of-possessive Total
Longer head 8 5 13
Constituent (6.3) (6.7)
mod 8 12 20
(9.7) (10.3)
Total 16 17 33
The 𝜒 2 value for this table is 0.7281, which at one degree of freedom means
that the p value is larger than 0.05, so we would have to conclude that there
is no influence of length on the choice of possessive construction. However, the
deviations of the observed from the expected frequencies go in the right direction,
so this may simply be due to the fact that our sample is too small (obviously, a
serious corpus-linguistic study would not be based on just 33 cases).
The normal-distribution requirement is only one of several requirements that
our data set must meet in order for particular statistical methods to be applicable.
198
6.6 Complex research designs
For example, many procedures for comparing group means – including the more
widely-used Student 𝑡-test – can only be applied if the two groups have the same
variance (roughly, if the measurements in both groups are spread out from the
group means to the same extent), and there are tests to tell us this (for example,
the F test). Also, it makes a difference whether the two groups that we are com-
paring are independent of each other (as in the case studies presented here), or
if they are dependent in that there is a correspondence between measures in the
two groups. For example, if we wanted to compare the length of heads and mod-
ifiers in the s-possessive, we would have two groups that are dependent in that
for any data point in one of the groups there is a corresponding data point in the
other group that comes from the same corpus example. In this case, we would
use a paired test (for example, the matched-pairs Wilcoxon test for ordinal data
and Student’s paired 𝑡-test for cardinal data).
199
6 Significance testing
a proper name or a noun with a possessive clitic. Given that the proportion of
pronouns and nouns in general varies across language varieties (Biber et al. 1999),
we might be interested to see whether the same is true for these three variants of
the s-possessive. Our dependent variable Modifier of s-Possessive would then
have three values. The independent variable Variety, being heavily dependent
on the model of language varieties we adopt – or rather, on the nature of the text
categories included in our corpus – has an indefinite number of values. To keep
things simple, let us distinguish just four broad text categories recognized in the
British National Corpus (and many other corpora): spoken, fiction, newspaper
and academic. This gives us a four-by-three design.
Searching the BNC Baby for words tagged as possessive pronouns and for
words tagged unambiguously as proper names or common nouns yields the ob-
served frequencies shown in the first line of each row in Table 6.14.
Table 6.14: Types of modifiers in the s-possessive in different text cate-
gories
Possessive Modifier
pronoun proper name noun Total
Variety spoken Obs. 9593 768 604 10 965
Exp. 8378.38 1361.04 1225.58
𝜒 2 -Comp. 176.08 258.40 315.25
fiction Obs. 23 755 2681 1998 28 434
Exp. 21 726.49 3529.39 3178.12
𝜒 2 -Comp. 189.39 203.94 438.21
news Obs. 12 857 4070 3585 20 512
Exp. 15 673.27 2546.07 2292.66
𝜒 2 -Comp. 506.04 912.14 728.47
academic Obs. 8533 1373 1820 11 726
Exp. 8959.86 1455.50 1310.64
𝜒 2 -Comp. 20.34 4.68 197.96
Total 54 738 8892 8007 71 637
The expected frequencies and the 𝜒 2 components are arrived at in the same
way as for the two-by-two tables in the preceding chapter. First, for each cell, the
sum of the column in which the cell is located is multiplied by the sum of the row
in which it is located, and the result is divided by the table sum. For example, for
the top left cell, we get the expected frequency
200
6.6 Complex research designs
54738 × 10965
= 8378.38,
71637
the expected frequencies are shown in the second line of each cell. Next, for
each cell, we calculate the 𝜒 2 component. For example, for the top left cell, we
get
(9593 − 8378.38)2
= 176.08,
8378.38
the corresponding values are shown in the third line of each cell. Adding up
the individual 𝜒 2 components gives us a 𝜒 2 value of 3950.89.
Using the formula given in (5) above, Table 6.14 has (4 − 1) × (3 − 1) = 6 degrees
of freedom. As the 𝜒 2 table in Section 14.1, the required value for a significance
level of 0.001 at 6 degrees of freedom is 22.46; the 𝜒 2 value for Table 6.14 is much
higher than this, thus, our results are highly significant. We could summarize
our findings as follows: “The frequency of pronouns, proper names and nouns
as modifiers of the s-possessive differs highly significantly across text categories
(𝜒 2 = 473.73, df = 12, 𝑝 < 0.001)”.
Recall that the mere fact of a significant association does not tell us anything
about the strength of that association – we need a measure of effect size. In the
preceding chapter, 𝜙 was introduced as an effect size for two-by-two tables (see
7). For larger tables, there is a generalized version of 𝜙, referred to as Cramer’s
V (or, occasionally, as Cramer’s 𝜙 or 𝜙 ′ ), which is calculated as follows (N is the
table sum, k is the number of rows or columns, whichever is smaller):
(20) 𝜒2
Cramer’s V =
√ 𝑁 × (𝑘 − 1)
For our table, this gives us:
3950.89
= 0.1661
√ 71637 × (3 − 1)
Recall that the square of a correlation coefficient tells us the proportion of the
variance captured by our design, which, in this case, is 0.0275. In other words, Va-
riety explains less than three percent of the distribution of s-possessor modifier
types across language varieties; or “This study has shown a very weak but highly
significant influence of language variety on the realization of s-possessor mod-
ifiers as pronouns, proper names or common nouns (𝜒 2 = 473.73, df = 12, 𝑝 <
0.001, 𝑟 = 0.0275).”
201
6 Significance testing
Despite the weakness of the effect, this result confirms our expectation that
general preferences for pronominal vs. nominal reference across language vari-
eties is also reflected in preferences for types of modifiers in the s-possessive.
However, with the increased size of the contingency table, it becomes more dif-
ficult to determine exactly where the effect is coming from. More precisely, it is
no longer obvious at a glance which of the intersections of our two variables con-
tribute to the overall significance of the result in what way and to what extent.
To determine in what way a particular intersection contributes to the overall
result, we need to compare the observed and expected frequencies in each cell.
For example, there are 9593 cases of s-possessives with pronominal modifiers in
spoken language, where 8378.38 are expected, showing that pronominal modi-
fiers are more frequent in spoken language than expected by chance. In contrast,
there are 8533 such modifiers in academic language, where 8959.86 are expected,
showing that they are less frequent in academic language than expecte by chance.
This comparison is no different from that which we make for two-by-two tables,
but with increasing degrees of freedom, the pattern becomes less predictable. It
would be useful to visualize the relation between observed and expected frequen-
cies for the entire table in a way that would allow us to take them in at a glance.
To determine to what extent a particular intersection contributes to the overall
result, we need to look at the size of the 𝜒 2 components – the larger the compo-
nent, the greater its contribution to the overall 𝜒 2 value. In fact, we can do more
than simply compare the 𝜒 2 components to each other – we can determine for
each component, whether it, in itself, is statistically significant. In order to do so,
we first imagine that the large contingency table (in our case, the 4-by-3 table)
consists of a series of tables with a single cell each, each containing the result for
a single intersection of our variables.
We now treat the 𝜒 2 component as a 𝜒 2 value in its own right, checking it for
statistical significance in the same way as the overall 𝜒 2 value. In order to do so,
we first need to determine the degrees of freedom for our one-cell tables – obvi-
ously, this can only be 1. Checking the table of critical 𝜒 2 values in Section 14.1,
we find, for example, that the 𝜒 2 component for the intersection pronoun ∩ spo-
ken, which is 176.08, is higher than the critical value 10.83, suggesting that this
intersection’s contribution is significant at 𝑝 < 0.001.
However, matters are slightly more complex: by looking at each intersection
separately, we are essentially treating each cell as an independent result – in our
case, it is as if we had performed twelve tests instead of just one. Now, recall
that levels of significance are based on probabilities of error – for example, 𝑝 =
0.05 means, roughly, that there is a five percent likelihood that a result is due to
chance. Obviously, the more tests we perform, the more likely it becomes that one
202
6.6 Complex research designs
of the results will, indeed, be due to chance – for example, if we had performed
twenty tests, we would expect one of them to yield a significant result at the
5-percent level, because 20 × 0.05 = 1.00.
To avoid this situation, we have to correct the levels of significance when per-
forming multiple tests on the same set of data. The simplest way of doing so is
the so-called Bonferroni correction, which consists in dividing the convention-
ally agreed-upon significance levels by the number of tests we are performing. In
the case of Table 6.14, this means dividing them by twelve, giving us significance
levels of 0.05/12 = 0.004167 (significant), 0.01/12 = 0.000833 (very significant), and
0.001/12 = 0.000083 (highly significant).6 Our table does not give the critical 𝜒 2 -
values for these levels, but the value for the the intersection pronoun ∩ spoken,
176.08, is larger than the value required for the next smaller level (0.00001, with
a critical value of 24.28), so we can be certain that the contribution of this inter-
section is, indeed, highly significant. Again, it would be useful to summarize the
degrees of significance in such a way that they can be assessed at a glance.
There is no standard way of representing the way in, and degree to, which each
cell of a complex table contributes to the overall result, but the representation in
Table 6.15 seems reasonable: in each cell, the first line contains either a plus (for
“more frequent than expected”) or a minus (for “less frequent than expected”);
the second line contains the 𝜒 2 -component, and the third line contains the (cor-
rected) level of significance (using the standard convention of representing them
by asterisks – one for each level of significance).
This table presents the complex results at a single glance; they can now be
interpreted. Some patterns now become obvious: For example, spoken language
and fiction are most similar to each other – they both favor pronominal modifiers,
while proper names and common nouns are disfavored, and the 𝜒 2 -components
for these preferences are very similar. Also, if we posit a kind of gradient of
referent familiarity from pronouns over proper names to nouns, we can place
spoken language and fiction at one end, academic language at the other, and
newspaper language somewhere in the middle.
203
6 Significance testing
Possessive Modifier
pronoun proper name noun
Variety spoken + − −
176.08 258.40 315.25
***** ***** *****
fiction + − −
189.39 203.94 438.21
***** ***** *****
news − + +
506.04 912.14 728.47
***** ***** *****
academic − − +
20.34 4.68 197.96
**** n.s. *****
main (in our design with the variables news (recounting of actual events),
fiction (recounting of imaginary events) and academic (recounting of
scientific ideas, procedures and results). These two variables are indepen-
dent in that there is both written and spoken language to be found in each of
these discourse domains. They are conflated in our variable Variety in that one
of the four values is spoken and the other three are written language, and in that
the text category spoken is not differentiated by topic. There may be reasons to
ignore this conflation a priori, as we have done – for example, our model may ex-
plicitly assume that differentiation by topic happens only in the written domain.
But even then, it would be useful to treat Medium and Discourse Domain as
independent variables, just in case our model is wrong in assuming this.
In contrast to all examples of research designs we have discussed so far, which
involved just two variables and were thus bivariate, this design would be multi-
variate: there is more than one independent variable whose influcence on the
dependent variable we wish to assess. Such multivariate research designs are of-
ten useful (or even necessary) even in cases where the variables in our design
are not conflations of more basic variables.
204
6.6 Complex research designs
In the study of language use, we will often – perhaps even typically – be con-
fronted with a fragment of reality that is too complex to model in terms of just
two variables.
In some cases, this may be obvious from the outset: we may suspect from
previous research that a particular linguistic phenomenon depends on a range of
factors, as in the case of the choice between the s- and the of -possessive, which
we saw in the preceding chapters had long been hypothesized to be influenced
by the animacy, the length and/or the givenness of the modifier.
In other cases, the multivariate nature of the phenomenon under investigation
may emerge in the course of pursuing an initially bivariate design. For example,
we may find that the independent variable under investigation has a statistically
significant influence on our dependent variable, but that the effect size is very
small, suggesting that the distribution of the phenomenon in our sample is con-
ditioned by more than one influencing factor.
Even if we are pursuing a well-motivated bivariate research design and find
a significant influence with a strong effect size, it may be useful to take addi-
tional potential influencing factors into account: since corpus data are typically
unbalanced, there may be hidden correlations between the variable under inves-
tigation and other variables, that distort the distribution of the phenomenon in a
way that suggests a significant influence where no such influence actually exists.
The next subsection will use the latter case to demonstrate the potential short-
comings of bivariate designs and the subsection following it will present a so-
lution. Note that this solution is considerably more complex than the statistical
procedures we have looked at so far and while it will be presented in sufficient
detail to enable the reader in principle to apply it themselves, some additional
reading will be highly advisable.
205
6 Significance testing
Let us take a look at the influence of Sex and Age on the choice between the
two possessives in the spoken part of the BNC. Since it is known that women tend
to use pronouns more than men do (see Case Study 10.2.3.1 in Chapter 10), let us
exclude possessive pronouns and operationalize the s-possessive as “all tokens
tagged POS in the BNC”, which will capture the possessive clitic ’s and zero pos-
sessives (on common nouns ending in alveolar fricatives). Since the spoken part
of the BNC is too large to identify of -possessives manually, let us operationalize
them somewhat crudely as “all uses of the preposition of ”; this encompasses not
just of -possessives, but also the quantifying and partitive of -constructions that
we manually excluded in the preceding chapters, the complementation of adjec-
tives like aware and afraid, verbs like consist and dispose, etc. On the one hand,
this makes our case study less precise, on the other hand, any preference for of -
constructions may just be a reflex of a general preference for the preposition of, in
which case we would be excluding relevant data by focusing on of -constructions.
Anyway, our main point will be one concerning statistical methodology, so it
does not matter too much either way.
So, let us query all tokens tagged as possessives (POS) or the preposition of
(PRF) in the spoken part of the BNC, discarding all hits for which the information
about speaker sex or speaker age is missing. Let us further exclude the age range
0–14, as it may include children who have not fully acquired the grammar of the
language, and the age range 60+ as too unspecific. To keep the design simple, let
us recode all age classes between 15 and 44 years of age as young and the age
range 45–59 als old (I fall into the latter, just in case someone thinks this category
label discriminates people in their prime). Let us further accept the categorization
of speakers into male and female that the makers of the BNC provide.
Table 6.16 shows the intersections of Construction and Sex in the results of
this query.
Unlike the studies mentioned above, we find a clear influence of Sex on Con-
struction, with female speakers preferring the s-possessive and male speakers
preferring the of -construction(s). The difference is highly significant, although
the effect size is rather weak (𝜒 2 = 773.55, df = 1, 𝑝 < 0.001, 𝜙 = 0.1061).
Next, let us look at the intersections of Construction and Sex in the results
of our query, which are shown in Table 6.17.
Like previous studies, we find a significant effect of age, with younger speakers
preferring the s-possessive and older speakers preferring the of -construction(s).
Again, the difference is highly significant, but the effect is extremely weak (𝜒 2 =
58.73, df = 1, 𝑝 < 0.001, 𝜙 = 0.02922).
We might now be satisfied that both speaker age and speaker sex have an influ-
ence on the choice between the two constructions. However, there is a potential
problem that we need to take into account: the values of the variables Sex and
206
6.6 Complex research designs
Table 6.16: The influence of Sex on the choice between the two posses-
sives
Construction
pos of Total
Sex female 3483 20 419 23 902
(2432.89) (21 469.11)
male 3515 41 335 44 850
(4565.11) (40 284.89)
Total 6998 61 754 68 752
Table 6.17: The influence of Age on the choice between the two posses-
sives
Construction
pos of Total
Age old 2450 24 535 26 985
(2746.70) (24 238.30)
young 4548 37 219 41 767
(4251.30) (37 515.70)
Total 6998 61 754 68 752
Age and their intersections are not necessarily distributed evenly in the subpart
of the BNC used here; although the makers of the corpus were careful to include
a broad range of speakers of all ages, sexes (and class memberships, ignored in
our study), they did not attempt to balance all these demographic variables, let
alone their intersections. So let us look at the intersection of Sex and Age in the
results of our query. These are shown in Table 6.18.
There are significantly fewer hits produced by old women and significantly
more produced by young women in our sample, and, conversely, significantly
fewer hits produced by young men and significantly more produced by old men.
This overrepresentation of young women and old men is not limited to our sam-
ple, but characterizes the spoken part of the BNC in general, which should in-
trigue feminists and psychoanalysts; for us, it suffices to know that the asymme-
tries in our sample are highly significant, with an effect size larger than that of
that in the preceding two tables (𝜒 2 = 2142.72, df = 1, 𝑝 < 0.001, 𝜙 = 0.1765).
207
6 Significance testing
Age
old young Total
Sex female 6559 17 343 23 902
(9381.48) (14 520.52)
male 20 426 24 424 44 850
(17 603.52) (27 246.48)
Total 26 985 41 767 68 752
This correlation in the corpus of old and male on the one hand and young
and female on the other may well be enough to distort the results such that a lin-
guistic behavior typical for female speakers may be wrongly attributed to young
speakers (or vice versa), and, correspondingly, a linguistic behavior typical for
male speakers may be wrongly interpreted to old speakers (or vice versa). More
generally, the danger of bivariate designs is that a variable we have chosen for
investigation is correlated with one or more variables ignored in our research
design, whose influence thus remains hidden. A very general precaution against
this possibility is to make sure that the corpus (or our sample) is balanced with re-
spect to all potentially confounding variables. In reality, this is difficult to achieve
and may in fact be undesirable, since we might, for example, want our corpus (or
sample) to reflect the real-world correlation of speaker variables).
Therefore, we need a way of including multiple independent variables in our
research designs even if we are just interested in a single independent variable,
but all the more so if we are interested in the influence of several independent
variables. It may be the case, for example, that both Sex and Age influence the
choice between ’s and of, either in that the two effects add up, or in that they
interact in more complex ways.
208
6.6 Complex research designs
than two nominal variables. This method has been used in psychology and psy-
chiatry since the 1970s, and while it has never become very wide-spread, it has, in
my opinion, a number of didactic advantages over other methods, when it comes
to understanding multivariate research designs. Most importantly, it is conceptu-
ally very simple (if you understand the 𝜒 2 test, you should be able to understand
CFA), and the results are very transparent (they are presented as observed and
expected frequencies of intersections of variables.
This does not mean that CFA is useful only as a didactic tool – it has been
applied fruitfully to linguistic research issues, for example, in the study of lan-
guage disorders (Lautsch et al. 1988), educational linguistics (Fujioka & Kennedy
1997), psycholinguistics (Hsu et al. 2000) and social psychology (Christmann et al.
2000). An early suggestion to apply it to corpus data is found in Schmilz (1983),
but the first actual such applications that I am aware of are Gries (2002; 2004).
Since Gries introduced the method to corpus linguistics, it has become a mi-
nor but nevertheless well-established corpus-linguistic research tool for a range
of linguistic phenomena (see, e.g., Stefanowitsch & Gries 2005; 2008; Liu 2010;
Goschler et al. 2013; Hoffmann 2014; Hilpert 2015; and others).
As hinted at above, in its simplest variant, configural frequency analysis is
simply a 𝜒 2 test on a contingency table with more than two dimensions. There is
no logical limit to the number of dimensions, but if we insist on calculating this
statistic manually (rather than, more realistically, letting a specialized software
package do it for us), then a three-dimensional table is already quite complex
to deal with. Thus, we will not go beyond three dimensions here or in the case
studies in the second part of this book.
A three-dimensional contingency table would have the form of a cube, as
shown in Figure 6.2. The smaller cube represents the cells on the far side of the
big cube seen from the same perspective and the smallest cube represents the
cell in the middle of the whole cube). As before, cells are labeled by subscripts:
the first subscript stands for the values and totals of the dependent variable, the
second for those of the first independent variable, and the third for those of the
second independent variable.
While this kind of visualization is quite useful in grasping the notion of a
three-dimensional contingency table, it would be awkward to use it as a basis
for recording observed frequencies or calculating the expected frequencies. Thus,
a possible two-dimensional representation is shown in Table 6.19. In this table,
the first independent variable is shown in the rows, and the second independent
variable is shown in the three blocks of three columns (these may be thought of
as three “slices” of the cube in Figure 6.2), and the dependent variable is shown
in the columns themselves.
209
6 Significance testing
De OT1T t
p
Va ende en den
ri a OT12 O21T ep
ble nt Ind iable
B
O212 r
Va
OT11 O11T
O211 O112
O111 O222
OT21 O12T
Independent
O221 O122
Variable A
OTT1 O1TT
O121
O2T1 O1T2
O1T1
OT2T
OT22 O22T
Independent
Variable A
OTTT
OTT2 O2TT
ent De
p
ep end B O2T2 Va ende
ri a
Ind iable ble nt
r
Va
210
6.6 Complex research designs
Given this representation, the expected frequencies for each intersection of the
three variables can now be calculated in a way that is similar (but not identical)
to that used for two-dimensional tables. Table 6.20 shows the formulas for each
cell as well as those marginal sums needed for the calculation.
Table 6.20: Calculating expected frequencies in a three-dimensional
contingency table
ivA Total
211
6 Significance testing
female male
Constr. young old Total young old Total Total
pos Obs.: 2548 Obs.: 935 3483 Obs.: 2000 Obs.: 1515 3515 6998
Exp.: 1477.99 Exp.: 954.90 Exp.: 2773.31 Exp.: 1791.79
𝜒 2: 774.65 𝜒 2: 0.41 𝜒 2: 215.63 𝜒 2: 42.76
p: *** p: n.s. p: *** p: ***
Type: + Type: – Type: – Type: –
of Obs.: 14 795 Obs.: 5624 20419 Obs.: 22 424 Obs.: 18 911 41335 61754
Exp.: 13 042.53 Exp.: 8426.57 Exp.: 24 473.17 Exp.: 15 811.73
𝜒 2: 235.47 𝜒 2: 932.10 𝜒 2: 171.58 𝜒 2: 607.49
p: *** p: *** p: *** p: ***
Type: + Type: – Type: – Type: +
Total 17 343 6559 23 902 24 424 20 426 44 850 68752
212
6.6 Complex research designs
Construction Construction
pos of Total pos of Total
Sex female 935 5624 6559 Sex female 2548 14 795 17 343
(595.50) (5963.50) (1888.48) (15 454.52)
male 1515 18 911 20 426 male 2000 22 424 24 424
(1854.50) (18 571.50) (2659.52) (21 764.48)
Total 2450 24 535 26 985 Total 4548 37 219 41 767
Construction Construction
pos of Total pos of Total
Age old 935 5624 6559 Age old 1515 18 911 20 426
(955.78) (5603.22) (1600.83) (18 825.17)
young 2548 14 795 17 343 young 2000 22 424 24 424
(2527.22) (14 815.78) (1914.17) (22 509.83)
Total 3483 20 419 23 902 Total 3515 41 335 44 850
213
6 Significance testing
Tables 6.22a and 6.22b show that the effect of Sex on the choice between the
two constructions is highly significant both in the group of old speakers and in
the group of young speakers, with effect sizes similar to those we found for the
bivariate analysis for Speaker Sex in Table 6.16 in the preceding section. This
effect seems to be genuine, or at least, it is not influenced by the hidden variable
Age (it may be influenced by Class or some other variable we have not included
in our design).
In contrast, Tables 6.22a and 6.22b show that the effect of Age that we saw
in Table 6.17 in the preceding section disappears completely for women, with a
𝑝-value not even significant at uncorrected levels of significance. For men, it is
still discernible, but only barely, with a 𝑝-value that indicates a very significant
relationship at corrected levels of significance, but with an effect size that is close
to zero. In other words, it may not be a genuine effect at all, but simply a conse-
quence of the fact that the intersections of Age and Sex are unevenly distributed
in the corpus.
This section is intended to impress on the reader one thing: that looking at
one potential variable influencing some phenomenon that we are interested in
may not be enough. Multivariate research designs are becoming the norm rather
than the exception, and rightly so. Excluding the danger of hidden variables is
just one advantage of such designs – in many cases, it is sensible to include
several independent variables simply because all of them potentially have an
interesting influence on the phenomenon under investigation, or because there
is just one particular combination of values of our variables that has an effect. In
the second part of this volume, there are several case studies that use CFA and
that illustrate these possibilities. One word of warning, however: the ability to
include a large number of variables in our research designs should not lead us to
do so for the sake of it. We should be able to justify, for each dependent variable
we include, why we are including it and in what way we expect it to influence
our independent variable.
214
7 Collocation
The (orthographic) word plays a central role in corpus linguistics. As suggested
in Chapter 4, this is in no small part due to the fact that all corpora, whatever
additional annotations may have been added, consist of orthographically repre-
sented language. This makes it easy to retrieve word forms. Every concordancing
program offers the possibility to search for a string of characters – in fact, some
are limited to this kind of query.
However, the focus on words is also due to the fact that the results of corpus
linguistic research quickly showed that words (individually and in groups) are
more interesting and show a more complex behavior than traditional, grammar-
focused theories of language assumed. An area in which this is very obvious, and
which has therefore become one of the most heavily researched areas in corpus
linguistics, is the way in which words combine to form so-called collocations.
This chapter is dedicated entirely to the discussion of collocation. At first, this
will seem like a somewhat abrupt shift from the topics and phenomena we have
discussed so far – it may not even be immediately obvious how they fit into the
definition of corpus linguistics as “the investigation of linguistic research ques-
tions that have been framed in terms of the conditional distribution of linguistic
phenomena in a linguistic corpus”, which was presented at the end of Chapter 2.
However, a closer look will show that studying the co-occurrence of words and/
or word forms is simply a special case of precisely this kind of research program.
7.1 Collocates
Trivially, texts are not random sequences of words. There are several factors in-
fluencing the likelihood of two (or more) words occurring next to each other.
First, the co-occurrence of words in a sequence is restricted by grammatical
considerations. For example, a definite article cannot be followed by another def-
inite article or a verb, but only by a noun, by an adjective modifying a noun, by
an adverb modifying such an adjective or by a post-determiner. Likewise, a tran-
sitive verb requires a direct object in the form of a noun phrase, so – barring cases
where the direct object is pre- or post-posed – it will be followed by a word that
can occur at the beginning of a noun phrase (such as a pronoun, a determiner,
an adjective or a noun).
7 Collocation
Note that Firth, although writing well before the advent of corpus linguistics,
refers explicitly to frequency as a characteristic of collocations. The possibility
of using frequency as part of the definition of collocates, and thus as a way of
identifying them, was quickly taken up. Halliday (1961) provides what is probably
the first strictly quantitative definition (cf. also Church & Hanks (1990) for a more
recent comprehensive quantitative discussion):
216
7.1 Collocates
Second Position
word b other words Total
First Position word a a&b a & other a
other words other & b other & other other
Total b other corpus size
On the basis of such a table, we can determine the collocation status of a given
word pair. For example, we can ask whether Firth was right with respect to the
claim that silly ass is a collocation. The necessary data are shown in Table 7.2: As
discussed above, the dependent variable is the First Position in the sequence,
with the values silly and ¬silly (i.e., all words that are not ass); the independent
variable is the Second Position in the sequence, with the values ass and ¬ass.
1
Note that we are using the corpus size as the table total – strictly speaking, we should be using
the total number of two-word sequences (bigrams) in the corpus, which will be lower: The
last word in each file of our corpus will not have a word following it, so we would have to
subtract the last word of each file – i.e., the number of files in our corpus – from the total. This
is unlikely to make much of a difference in most cases, but the shorter the texts in our corpus
are, the larger the difference will be. For example, in a corpus of tweets, which, at the time of
writing, are limited to 280 characters, it might be better to correct the total number of bigrams
in the way described.
217
7 Collocation
Second Position
ass ¬ass Total
First Position silly 7 2632 2639
(0.01) (2638.99)
¬silly 295 98 360 849 98 361 144
(301.99) (98 360 842.01)
Total 302 98 363 481 98 363 783
The combination silly ass is very rare in English, occurring just seven times
in the 98 363 783 word BNC, but the expected frequencies in Table 7.2 show that
this is vastly more frequent than should be the case if the words co-occurred ran-
domly – in the latter case, the combination should have occurred just 0.01 times
(i.e., not at all). The difference between the observed and the expected frequen-
cies is highly significant (𝜒 2 = 6033.8, df = 1, 𝑝 < 0.001). Note that we are using
the 𝜒 2 test here because we are already familiar with it. However, this is not the
most useful test for the purpose of identifying collocations, so we will discuss
better options below.
Generally speaking, the goal of a quantitative collocation analysis is to iden-
tify, for a given word, those other words that are characteristic for its context of
usage. Tables 7.1 and 7.2 present the most straightforward way of doing so: we
simply compare the frequency with which two words co-occur to the frequencies
with which they occur in the corpus in general. In other words, the two condi-
tions across which we are investigating the distribution of a word are “next to a
given other word” and “everywhere else”. This means that the corpus itself func-
tions as a kind of neutral control condition, albeit a somewhat indiscriminate one:
comparing the frequency of a word next to some other word to its frequency in
the entire rest of the corpus is a bit like comparing an experimental group of sub-
jects that have been given a particular treatment to a control group consisting of
all other people who happen to live in the same city.
Often, we will be interested in the distribution of a word across two specific
conditions – in the case of collocation, the distribution across the immediate
contexts of two semantically related words. It may be more insightful to compare
adjectives occurring next to ass with those occurring next to the rough synonym
donkey or the superordinate term animal. Obviously, the fact that silly occurs
218
7.1 Collocates
more frequently with ass than with donkey or animal is more interesting than
the fact that silly occurs more frequently with ass than with stone or democracy.
Likewise, the fact that silly occurs with ass more frequently than childish is more
interesting than the fact that silly occurs with ass more frequently than precious
or parliamentary.
In such cases, we can modify Table 7.1 as shown in Table 7.3 to identify the
collocates that differ significantly between two words. There is no established
term for such collocates, so we we will call them differential collocates here2 (the
method is based on Church et al. 1991).
Table 7.3: Identifying differential collocates
Second Position
word b word c Total
First Position word a a&b a&c a
other other & b other & c other
Total b c sample size
Since the collocation silly ass and the word ass in general are so infrequent in
the BNC, let us use a different noun to demonstrate the usefulness of this method,
the word game. We can speak of silly game(s) or childish game(s), but we may feel
that the latter is more typical than the former. The relevant lemma frequencies
to put this feeling to the test are shown in Table 7.4.
Table 7.4: Childish game vs. silly game (lemmas) in the BNC
First Position
childish silly Total
Second Position game 12 31 43
(6.18) (36.82)
¬game 431 2608 3039
(436.82) (2602.18)
Total 443 2639 3082
2
Gries (2003b) and Gries & Stefanowitsch (2004) use the term distinctive collocate, which has
been taken up by some authors; however, many other authors use the term distinctive collocate
much more broadly to refer to characteristic collocates of a word.
219
7 Collocation
The sequences childish game(s) and silly game(s) both occur in the BNC. Both
combinations taken individually are significantly more frequent than expected
(you may check this yourself using the frequencies from Table 7.4, the total
lemma frequency of game in the BNC (20 627), and the total number of words
in the BNC given in Table 7.2 above). The lemma sequence silly game is more
frequent, which might lead us to assume that it is the stronger collocation. How-
ever, the direct comparison shows that this is due to the fact that silly is more
frequent in general than childish, making the combination silly game more proba-
ble than the combination childish game even if the three words were distributed
randomly. The difference between the observed and the expected frequencies
suggests that childish is more strongly associated with game(s) than silly. The
difference is significant (𝜒 2 = 6.49, df = 1, 𝑝 < 0.05).
Researchers differ with respect to what types of co-occurrence they focus on
when identifying collocations. Some treat co-occurrence as a purely sequential
phenomenon defining collocates as words that co-occur more frequently than
expected within a given span. Some researchers require a span of 1 (i.e., the words
must occur directly next to each other), but many allow larger spans (five words
being a relatively typical span size).
Other researchers treat co-occurrence as a structural phenomenon, i.e., they
define collocates as words that co-occur more frequently than expected in two
related positions in a particular grammatical structure, for example, the adjec-
tive and noun positions in noun phrases of the form [Det Adj N] or the verb
and noun position in transitive verb phrases of the form [V [NP (Det) (Adj) N]].3
However, instead of limiting the definition to one of these possibilities, it seems
more plausible to define the term appropriately in the context of a specific re-
search question. In the examples above, we used a purely sequential definition
that simply required words to occur next to each other, paying no attention to
their word-class or structural relationship; given that we were looking at adjec-
tive-noun combinations, it would certainly have been reasonable to restrict our
search parameters to adjectives modifying the noun ass, regardless of whether
other adjectives intervened, for example in expressions like silly old ass, which
our query would have missed if they occurred in the BNC (they do not).
It should have become clear that the designs in Tables 7.1 and 7.3 are essen-
tially variants of the general research design introduced in previous chapters
and used as the foundation of defining corpus linguistics: it has two variables,
3
Note that such word-class specific collocations are sometimes referred to as colligations, al-
though the term colligation usually refers to the co-occurrence of a word in the context of
particular word classes, which is not the same.
220
7.1 Collocates
Position 1 and Position 2, both of which have two values, namely word x vs.
other words (or, in the case of differential collocates, word x vs. word y). The
aim is to determine whether the value word a is more frequent for Position 1
under the condition that word b occurs in Position 2 than under the condition
that other words (or a particular other word) occur in Position 2.
221
7 Collocation
222
7.1 Collocates
the last thirty years. In fact, it is sometimes difficult to imagine a plausible hy-
pothesis for collocational research projects. What hypothesis would we formu-
late before identifying all collocations in the LOB or some specialized corpus (e.g.,
a corpus of business correspondence, a corpus of flight-control communication
or a corpus of learner language)?4 Despite this, it is clear that the results of such
a collocation analysis yield interesting data, both for practical purposes (build-
ing dictionaries or teaching materials for business English or aviation English,
extracting terminology for the purpose of standardization, training natural-lan-
guage processing systems) and for theoretical purposes (insights into the nature
of situational language variation or even the nature of language in general).
But there is a danger, too: Most statistical procedures will produce some statis-
tically significant result if we apply them to a large enough data set, and colloca-
tional methods certainly will. Unless we are interested exclusively in description,
the crucial question is whether these results are meaningful. If we start with a hy-
pothesis, we are restricted in our interpretation of the data by the need to relate
our data to this hypothesis. If we do not start with a hypothesis, we can interpret
our results without any restrictions, which, given the human propensity to see
patterns everywhere, may lead to somewhat arbitrary post-hoc interpretations
that could easily be changed, even reversed, if the results had been different and
that therefore tell us very little about the phenomenon under investigation or
language in general. Thus, it is probably a good idea to formulate at least some
general expectations before doing a large-scale collocation analysis.
Even if we do start out with general expectations or even with a specific hy-
pothesis, we will often discover additional facts about our phenomenon that go
beyond what is relevant in the context of our original research question. For ex-
ample, checking in the BNC Firth’s claim that the most frequent collocates of
ass are silly, obstinate, stupid, awful and egregious and that young is “much more
frequent” than old, we find that silly is indeed the most frequent adjectival collo-
cate, but that obstinate, stupid and egregious do not occur at all, that awful occurs
only once, and that young and old both occur twice. Instead, frequent adjectival
collocates (ignoring second-placed wild, which exclusively refers to actual don-
keys), are pompous and bad. Pompous does not really fit with the semantics that
Firth’s adjectives suggest and could indicate that a semantic shift from ‘stupidity’
to ‘self-importance’ may have taken place between 1957 and 1991 (when the BNC
was assembled).
4
Of course we are making the implicit assumption that there will be collocates – in a sense, this
is a hypothesis, since we could conceive of models of language that would not predict their
existence (we might argue, for example, that at least some versions of generative grammar
constitute such models). However, even if we accept this as a hypothesis, it is typically not the
one we are interested in this kind of study.
223
7 Collocation
This is, of course, a new hypothesis that can (and must) be investigated by
comparing data from the 1950s and the 1990s. It has some initial plausibility in
that the adjectives blithering, hypocritical, monocled and opinionated also co-oc-
cur with ass in the BNC but are not mentioned by Firth. However, it is crucial to
treat this as a hypothesis rather than a result. The same goes for bad ass which
suggests that the American sense of ass (‘bottom’) and/or the American adjective
badass (which is often spelled as two separate words) may have begun to enter
British English. In order to be tested, these ideas – and any ideas derived from an
exploratory data analysis – have to be turned into testable hypotheses and the
constructs involved have to be operationalized. Crucially, they must be tested on
a new data set – if we were to circularly test them on the same data that they
were derived from, we would obviously find them confirmed.
Second Position
word b other/word c Total
First Position word a O11 O12 R1
other words O21 O22 R2
Total C1 C2 N
224
7.1 Collocates
Second Position
example ¬example Total
First Position good 9 836 845
(0.2044) (844.7956)
¬good 236 1 011 904 1 012 140
(244.7956) (1 011 895.2044)
Total 245 1 012 740 1 012 985
Measures of collocation strength differ with respect to the data needed to cal-
cuate them, their computational intensiveness and, crucially, the quality of their
results. In particular, many measures, notably the ones easy to calculate, have a
problem with rare collocations, especially if the individual words of which they
consist are also rare. After we have introduced the measures, we will therefore
compare their performance with a particular focus on the way in which they deal
(or fail to deal) with such rare events.
7.1.3.1 Chi-square
The first association measure is an old acquaintance: the chi-square statistic,
which we used extensively in Chapter 6 and in Section 7.1.1 above. I will not
demonstrate it again, but the chi-square value for Table 7.6 would be 378.95 (at
1 degree of freedom this means that 𝑝 < 0.001, but we are not concerned with
𝑝-values here).
Recall that the chi-square test statistic is not an effect size, but that it needs
to be divided by the table total to turn it into one. As long as we are deriving
all our collocation data from the same corpus, this will not make a difference,
since the table total will always be the same. However, this is not always the
case. Where table sizes differ, we might consider using the phi value instead. I
am not aware of any research using phi as an association measure, and in fact
the chi-square statistic itself is not used widely either. This is because it has a
serious problem: recall that it cannot be applied if more than 20 percent of the
225
7 Collocation
cells of the contingency table contain expected frequencies smaller than 5 (in
the case of collocates, this means not even one out of the four cells of the 2-by-2
table). One reason for this is that it dramatically overestimates the effect size and
significance of such events, and of rare events in general. Since collocations are
often relatively rare events, this makes the chi-square statistic a bad choice as an
association measure.
5
A logarithm with a base 𝑏 of a given number 𝑥 is the power to which b must be raised to
produce 𝑥, so, for example, log10 (2) = 0.30103, because 100.30103 = 2. Most calculators offer at
the very least a choice between the natural logarithm, where the base is the number 𝑒 (approx.
2.7183) and the common logarithm, where the base is the number 10; many calculators and all
major spreadsheet programs offer logarithms with any base. In the formula in (1), we need the
logarithm with base 2; if this is not available, we can use the natural logarithm and divide the
result by the natural logarithm of 2:
𝑂11
log𝑒 ( )
𝐸11
MI =
log𝑒 (2)
226
7.1 Collocates
The mutual information measure suffers from the same problem as the 𝜒 2
statistic: it overestimates the importance of rare events. Since it is still fairly wide-
spread in collocational research, we may nevertheless need it in situations where
we want to compare our own data to the results of published studies. However,
note that there are versions of the MI measure that will give different results,
so we need to make sure we are using the same version as the study we are
comparing our results to. But unless there is a pressing reason, we should not
use mutual information at all.
In order to calculate the 𝐺 measure, we calculate for each cell the natural loga-
rithm of the observed frequency divided by the expected frequency and multiply
it by the observed frequency. We then add up the results for all four cells and
multiply the result by two. Note that if the observed frequency of a given cell
is zero, the expression 𝑂𝑖/𝐸𝑖 will, of course, also be zero. Since the logarithm of
zero is undefined, this would result in an error in the calculation. Thus, log(0) is
simply defined as zero when applying the formula in (3).
Applying the formula in (3) to the data in Table 7.6, we get the following:
9 836
𝐺 = 2 × (9 × log𝑒 ( )) + (836 × log𝑒 ( ))
0.2044 844.7956
236 1011904
+ (236 × log𝑒 ( )) + (1011904 × log𝑒 ( ))
244.7956 1011895.2044
= 2 × ((34.0641) + (−8.7497) + (−8.6357) + (8.7956)) = 50.9489
The G value has long been known to be more reliable than the 𝜒 2 test when
dealing with small samples and small expected frequencies (Read & Cressie 1988:
134ff). This led Dunning (1993) to propose it as an association measure specifi-
cally to avoid the overestimation of rare events that plagues the 𝜒 2 test, mutual
information and other measures.
227
7 Collocation
228
7.1 Collocates
alone 1 011 904. But if we could, we would find that the 𝑝-value for Table 7.6 is
0.000000000001188.
Spreadsheet applications do not usually offer Fisher’s exact test, but all major
statistics applications do. However, typically, the exact 𝑝-value is not reported
beyond the limit of a certain number of decimal places. This means that there is
often no way of ranking the most strongly associated collocates, because their 𝑝-
values are smaller than this limit. For example, there are more than 100 collocates
in the LOB corpus with a Fisher’s exact 𝑝-value that is smaller than the smallest
value that a standard-issue computer chip is capable of calculating, and more
than 5000 collocates that have 𝑝-values that are smaller than what the standard
implementation of Fisher’s exact test in the statistical software package R will
deliver. Since in research on collocations we often need to rank collocations in
terms of their strength, this may become a problem.
229
7 Collocation
are a number of fully compositional combinations that make sense but do not
have any special status (caparisoned mule, new horse, old donkey, young zebra,
large mule, female hinny, extinct quagga).
In addition, I have selected them to represent different types of frequency re-
lations: some of them are (relatively) frequent, some of them very rare, for some
of them the either the adjective or the noun is generally quite frequent, and for
some of them neither of the two is frequent.
Table 7.8 shows the ranking of these twenty collocations by the five association
measures discussed above. Simplifying somewhat, a good association measure
should rank the conventionalized combinations highest (rocking horse, Trojan
horse, silly ass, pompous ass, prancing horse, braying donkey, galloping horse), the
distinctive sounding but non-conventionalized combinations somewhere in the
middle (jumped-up jackass, dumb-fuck donkey, old ass, monocled ass) and the com-
positional combinations lowest (common zebra, jumped-up jackass, dumb-fuck
donkey, old ass, monocled ass). Common zebra is difficult to predict – it is a con-
ventionalized expression, but not in the general language.
230
Table 7.8: Comparison of selected association measures for collocates
of the form [ADJ Nequine ] (BNC)
231
7.1 Collocates
7 Collocation
All association measures fare quite well, generally speaking, with respect to
the compositional expressions – these tend to occur in the lower third of all lists.
Where there are exceptions, the 𝜒 2 statistic, mutual information and minimum
sensitivity rank rare cases higher than they should (e.g. caparisoned mule, extinct
quagga), while the G and the 𝑝-value of Fisher’s exact test rank frequent cases
higher (e.g.galloping horse).
With respect to the non-compositional cases, 𝜒 2 and mutual information are
quite bad, overestimating rare combinations like jumped-up jackass, dumb-fuck
donkey and monocled ass, while listing some of the clear cases of collocations
much further down the list (silly ass, and, in the case of MI, rocking horse). Min-
imum sensitivity is much better, ranking most of the conventionalized cases in
the top half of the list and the non-conventionalized ones further down (with
the exception of jumped-up jackass, where both the individual words and their
combination are very rare). The G and the Fisher 𝑝-value fare best (with no dif-
ferences in their ranking of the expressions), listing the conventionalized cases
at the top and the distinctive but non-conventionalized cases in the middle.
To demonstrate the problems that very rare events can cause (especially those
where both the combination and each of the two words in isolation are very
rare), imagine someone had used the phrase tomfool onager once in the BNC.
Since neither the adjective tomfool (a synonym of silly) nor the noun onager (the
name of the donkey sub-genus Equus hemionus, also known as Asiatic or Asian
wild ass) occur in the BNC anywhere else, this would give us the distribution in
Table 7.9.
Table 7.9: Fictive occurrence of tomfool onager in the BNC
Second Position
onager ¬onager Total
First Position tomfool 1 0 1
(0.00) (1.00)
¬tomfool 0 98 363 782 98 363 782
(1.00) (98 363 781.00)
Total 1 98 363 782 98 363 783
232
7.2 Case studies
wide margin. Again, the log-likelihood ratio test and Fisher’s exact test are much
better, putting in eighth place on both lists (𝐺 = 36.81, 𝑝exact = 1.02 × 10−8 ).
Although the example is hypothetical, the problem is not. It uncovers a math-
ematical weakness of many commonly used association measures. From an em-
pirical perspective, this would not necessarily be a problem, if cases like that in
Table 7.9 were rare in linguistic corpora. However, they are not. The LOB corpus,
for example, contains almost one thousand such cases, including some legitimate
collocation candidates (like herbal brews, casus belli or sub-tropical climates), but
mostly compositional combinations (ungraceful typography, turbaned headdress,
songs-of-Britain medley), snippets of foreign languages (freie Blicke, l’arbre rouge,
palomita blanca) and other things that are quite clearly not what we are looking
for in collocation research. All of these will occur at the top of any collocate list
created using statistics like 𝜒 2 , mutual information and minimum sensitivity. In
large corpora, which are impossible to check for orthographical errors and/or
errors introduced by tokenization, this list will also include hundreds of such
errors (whose frequency of occurrence is low precisely because they are errors).
To sum up, when doing collocational research, we should use the best associ-
ation measures available. For the time being, this is the p value of Fisher’s exact
test (if we have the means to calculate it), or G (if we don’t, or if we prefer us-
ing a widely-accepted association measure). We will use G through much of the
remainder of this book whenever dealing with collocations or collocation-like
phenomena.
233
7 Collocation
234
7.2 Case studies
ally and particularly). However, most degree adverbs are clearly associated with
semantically restricted sets of adjectives. The restrictions are of three broad types.
First, there are connotational restrictions (some adverbs are associated primar-
ily with positive words (e.g. perfectly) or negative words (e.g. utterly, totally; on
connotation cf. also Section 7.2.3). Second, there are specific semantic restrictions
(for example, incredibly, which is associated with subjective judgments), some-
times relating transparently to the meaning of the adverb (for example, badly,
which is associated with words denoting damage or clearly, which is associated
with words denoting sensory perception). Finally, there are morphological re-
strictions (some adverbs are used frequently with words derived by particular
suffixes, for example, perfectly, which is frequently found with words derived by
-able/-ible, or totally, whose collocates often contain the prefix un-). Table 7.10 il-
lustrates these findings for 5 of the 24 degree adverbs and their top 15 collocates.
Unlike Kennedy, I have used the G statistic of the log-likelihood ratio test,6
and so the specific collocates differ from the ones he finds (generally, his lists
include more low-frequency combinations, as expected given that he uses mutual
information), but his observations concerning the semantic and morphological
sets are generally confirmed.
This case study illustrates the exploratory design typical of collocational re-
search as well as the kind of result that such studies yield and the observations
possible on the basis of these results. By comparing the results reported here to
Kennedy’s, you may also gain a better understanding as to how different associ-
ation measures may lead to different results.
235
236
Table 7.10: Selected degree adverbs and their collocates
7 Collocation
it is often difficult to tell what the difference in meaning is, especially since they
are often interchangeable at least in some contexts. Obviously, the distribution
of such pairs or sets with respect to other words in a corpus can provide insights
into their similarities and differences.
One example of such a study is Taylor (2003), which investigates the synonym
pair high and tall by identifying all instances of the two words in their subsense
‘large vertical extent’ in the LOB corpus and categorizing the words they mod-
ify into eleven semantic categories. These categories are based on semantic dis-
tinctions such as human vs. inanimate, buildings vs. other artifacts vs. natural
entities, etc., which are expected a priori to play a role.
The study, while not strictly hypothesis-testing, is thus somewhat deductive.
It involves two nominal variables; the independent variable Type of Entity with
eleven values shown in Table 7.11 and the dependent variable Vertical Extent
Adjective with the values high and tall (assuming that people first choose
something to talk about and then choose the appropriate adjective to describe it).
Table 7.11 shows Taylor’s results (he reports absolute and relative frequencies,
which I have used to calculate expected frequencies and 𝜒 2 components).
As we can see, there is little we can learn from this table, since the frequencies
in the individual cells are simply too small to apply the 𝜒 2 test to the table as
a whole. The only 𝜒 2 components that reach significance individually are those
for the category human, which show that tall is preferred and high avoided with
human referents. The sparsity of the data in the table is due to the fact that the
analyzed sample is very small, and this problem is exacerbated by the fact that the
little data available is spread across too many categories. The category labels are
not well chosen either: they overlap substantially in several places (e.g., towers
and walls are buildings, pieces of clothing are artifacts, etc.) and not all of them
seem relevant to any expectation we might have about the words high and tall.
Taylor later cites earlier psycholinguistic research indicating that tall is used
when the vertical dimension is prominent, is an acquired property and is a prop-
erty of an individuated entity. It would thus have been better to categorize the
corpus data according to these properties – in other words, a more strictly de-
ductive approach would have been more promising given the small data set.
Alternatively, we can take a truly exploratory approach and look for differ-
ential collocates as described in Section 7.1.1 above – in this case, for differential
noun collocates of the adjectives high and tall. This allows us to base our analysis
on a much larger data set, as the nouns do not have to be categorized in advance.
Table 7.12 shows the top 15 differential collocates of the two words in the BNC.
237
7 Collocation
Table 7.11: Objects described as tall or high in the LOB corpus (adapted
from Taylor 2003)
Adjective
Noun Category tall high Total
humans Obs.: 45 Obs.: 2 47
Exp.: 22.91 Exp.: 24.09
𝜒 2: 21.31 𝜒 2: 20.26
animals Obs.: 0 Obs.: 1 1
Exp.: 0.49 Exp.: 0.51
𝜒 2: 0.49 𝜒 2: 0.46
plants, trees Obs.: 7 Obs.: 3 10
Exp.: 4.87 Exp.: 5.13
𝜒 2: 0.93 𝜒 2: 0.88
buildings Obs.: 3 Obs.: 10 13
Exp.: 6.34 Exp.: 6.66
𝜒 2: 1.76 𝜒 2: 1.67
walls, fences, Obs.: 0 Obs.: 5 5
etc Exp.: 2.44 Exp.: 2.56
𝜒 2: 2.44 𝜒 2: 2.32
towers, statues, Obs.: 0 Obs.: 7 7
pillars, sticks Exp.: 3.41 Exp.: 3.59
𝜒 2: 3.41 𝜒 2: 3.24
articles of Obs.: 0 Obs.: 7 7
clothing Exp.: 3.41 Exp.: 3.59
𝜒 2: 3.41 𝜒 2: 3.24
miscellaneous Obs.: 2 Obs.: 13 15
artifacts Exp.: 7.31 Exp.: 7.69
𝜒 2: 3.86 𝜒 2: 3.67
topographical Obs.: 0 Obs.: 5 5
features Exp.: 2.44 Exp.: 2.56
𝜒 2: 2.44 𝜒 2: 2.32
other natural Obs.: 0 Obs.: 5 5
phenomena Exp.: 2.44 Exp.: 2.56
𝜒 2: 2.44 𝜒 2: 2.32
uncertain Obs.: 1 Obs.: 3 4
reference Exp.: 1.95 Exp.: 2.05
𝜒 2: 0.46 𝜒 2: 0.44
Total 58 61 119
238
7.2 Case studies
Table 7.12: Differential collocates for tall and high in the BNC
239
7 Collocation
The results for tall clearly support Taylor’s ideas about the salience of the
vertical dimension. The results for high show something Taylor could not have
found, since he restricted his analysis to the subsense ‘vertical dimension’: when
compared with tall, high is most strongly associated with quantities or positions
in hierarchies and rankings. There are no spatial uses at all among its top differ-
ential collocates. This does not answer the question why we can use it spatially
and in competition with tall, but it shows what general sense we would have to
assume: one concerned not with the vertical extent as such, but with the magni-
tude of that extent (which, incidentally, Taylor notes in his conclusion).
This case study shows how the same question can be approached by a deduc-
tive or an inductive (exploratory) approach. The deductive approach can be more
precise, but this depends on the appropriateness of the categories chosen a pri-
ori for annotating the data; it is also time consuming and therefore limited to
relatively small data sets. In contrast, the inductive approach can be applied to a
large data set because it requires no a priori annotation. It also does not require
any choices concerning annotation categories; however, there may be a danger
to project patterns into the data post hoc.
240
7.2 Case studies
Bad
occurs ¬occurs Total
Good occurs 16 687 703
(1.57) (701.43)
¬occurs 110 55 769 55 879
(124.43) (55 754.57)
Total 126 56 456 56 582
241
7 Collocation
within this pattern as shown in Table 7.14 for the adjectives good and bad in the
BNC, and then categorizing the most strongly associated collocates in terms of
the lexical relationships between them.
Table 7.14: Co-occurrence of good and bad in the first and second slot
of [ADJ1 and ADJ2 ]
Second Slot
bad ¬bad Total
First Slot good 158 476 634
(0.89) (633.11)
¬good 35 136 893 136 928
(192.11) (136 735.89)
Total 193 137 369 137 562
Note that this is a slightly different procedure from what we have seen before:
instead of comparing the frequency of co-occurrence of two words with their
individual occurrence in the rest of the corpus, we are comparing it to their indi-
vidual occurrence in a given position of a given structure – in this case [ADJ and
ADJ] (Stefanowitsch & Gries (2005) call this kind of design covarying collexeme
analysis).
Table 7.15 shows the thirty most strongly associated adjective pairs coordi-
nated with and in the BNC.
Clearly, antonymy is the dominant relation among these word pairs, which are
mostly opposites (black/white, male/female, public/private, etc.), and sometimes
relational antonyms (primary/secondary, economic/social, economic/political, so-
cial/political, lesbian/gay, etc.). The only cases of non-antonymic pairs are eco-
nomic/monetary, which is more like a synonym than an antonym and the fixed
expressions deaf /dumb and hon(ourable)/learned (as in honourable and learned
gentleman/member/friend). The pattern does not just hold for the top 30 collo-
cates but continues as we go down the list. There are additional cases of rela-
tional antonyms, like British/American and Czech/Slovak and additional exam-
ples of fixed expressions (alive and well, far and wide, true and fair, null and void,
noble and learned), but most cases are clear antonyms (for example, syntactic/
semantic, spoken/written, mental/physical, right/left, rich/poor, young/old, good/
evil, etc.). The one systematic exceptions are cases like worse and worse (a special
construction with comparatives indicating incremental change, cf. Stefanowitsch
2007b).
242
7.2 Case studies
243
7 Collocation
This case study shows how deductive and inductive domains may complement
each other: while the deductive studies cited show that antonyms tend to co-
occur syntagmatically, the inductive study presented here shows that words
that co-occur syntagmatically (at least in certain syntactic contexts) tend to be
antonyms. These two findings are not equivalent; the second finding shows that
the first finding may indeed be typical for antonymy as opposed to other lexical
relations.
The exploratory study was limited to a particular syntactic/semantic context,
chosen because it seems semantically and pragmatically neutral enough to allow
all kinds of lexical relations to occur in it. There are contexts which might be ex-
pected to be particularly suitable to particular kinds of lexical relations and which
could be used, given a large enough corpus, to identify word pairs in such rela-
tions. For example, the pattern [ADJ rather than ADJ] seems semantically predis-
posed for identifying antonyms, and indeed, it yields pairs like implicit/explicit,
worse/better, negative/positive, qualitative/quantitative, active/passive, real/appar-
ent, local/national, political/economical, etc. Other patterns are semantically more
complex, identifying pairs in more context-dependent oppositions; for example,
[ADJ but not ADJ] identifies pairs like desirable/essential, necessary/sufficient,
similar/identical, small/insignificant, useful/essential, difficult/impossible. The re-
lation between the adjectives in these pairs is best described as pragmatic – the
first one conventionally implies the second.
244
7.2 Case studies
the term is widely-used in (at least) these two different ways, and since “posi-
tive” and “negative” connotations are very general kinds of attitudinal meaning,
it seems more realistic to accept a certain vagueness of the term. If necessary, we
could differentiate between the general semantic prosody of a word (its “positive”
or “negative” connotation as reflected in its collocates) and its specific semantic
prosody (the word-specific attitudinal meaning reflected in its collocates).
1 f unless you 're absolutely sure of your [true feelings] . I had a similar experience several ye
2 nces may well not reflect my employer 's [true feelings] on the matter , but once having sustain
3 and realize it is all right to show our [true feelings] and that it is all right to be rejected
4 wing right action : acting only from our [true feelings] , not governed by the distortions of em
5 der , but the problem of ` reading ' the [true feelings] of the individual can be made easier by
6 other . Having declared to Roderigo his [true feelings] about Othello , Iago later explains why
7 ell studied in the art of disguising his [true feelings] . Let him not be frightened of me ; let
8 rised that the TV presenter revealed her [true feelings] towards Nicola so quickly : most people
9 embers are helpful to show each side the [true feelings] of the other , the need to accept and w
10 good husband , but you like to hide your [true feelings] . ' ` Oh , do n't be so serious , B
11 er , he has n't actually dealt with the [true feelings] that he had towards his father , and wh
12 g as ` friends ' , without revealing her [true feelings] for him . It was still light when he pi
13 t the parents will often not admit their [true feelings] about the child and the incident , acti
14 t a matter of time before she showed her [true feelings] , I was sure of that . Females -- hone
15 m for so long at last gave vent to their [true feelings] . The match had been billed in the Amer
16 eople . And got him plenty sex . Rory 's [true feelings] about the matter were complex but red-b
17 t had finally forced her to confront her [true feelings] for Arnie . Or rather , her lack of fee
18 rage in both hands , and told him of her [true feelings] , they might have had a chance to work
19 andmother finds it difficult to show her [true feelings] . ' said David . ` I think it 's a
20 er heart did more to convince her of her [true feelings] than any rational thinking . She wanted
246
7.2 Case studies
1 by the rest of the board ? Re-programme [your feelings] , in that case . The annual BW accounts
2 the Asian women I spoke to told me about [their feelings] and situations . Here I shall try to d
3 ractive , but I think you might consider [my feelings] as well as your own. , Another pause .
4 o trust her more , dared to feel more of [my feelings] , instead of eating them away . It woul
5 all was in order . It is hard to explain [my feelings] once I did finally set off . For the fi
6 e family and the old person work through [their feelings] about any restrictions . This contract
7 say . ` Nothing is ever going to change [their feelings] towards me . ` I 've tried everything
8 han rights . It is about men reconciling [their feelings] towards their fathers and learning how
9 l family . It is as if to let people see [your feelings] takes away some of your power . But at
10 eyelids defensively lowered to disguise [her feelings] . Crossing her legs discreetly , she du
11 nxiety ? Should n't she just accept that [her feelings] about her mother 's lifestyle were irra
12 o stop things before they went too far . [His feelings] had gone no deeper than the surface . N
13 resentment , because you do n't care for [my feelings] at all . You always think the worst of
14 etence , could n't face having to stifle [her feelings] , her crazy and immature hopes -- hope
15 Remember ? ' ` I thought I could control [my feelings] , have an exciting affair with you and
16 her and kissing her softly , she voiced [her feelings] by saying , ` I love you , Gran . '
17 our lack of understanding with regard to [his feelings] as a father . ' ` Oh , Great-gran ,
18 right , then , the doubts you had about [your feelings] . ' ` You mean my feelings towards
19 y North-West 's Billy Anderson who vents [his feelings] about the lack of North-West representa
20 that is by giving them a copy . That 's [my feelings] erm . I move . Thanks very much indeed
247
7 Collocation
Prosody
reluctance ¬reluctance Total
Expression true feelings 11 9 20
(7.50) (12.50)
[poss feelings] 4 16 20
(7.50) (12.50)
Total 15 25 40
1 r-head wolf-whistles . Real situations , [real feelings] , real people , real love . The album s
2 onal Checklist : I do my best to hide my [real feelings] from others I always try to please othe
3 , how to manipulate , how to hide their [real feelings] and how to convince those that love the
4 f the death of a cousin . Disguising his [real feelins] he wrote cheerfully , telling them that
5 her words , the counsellor must seek the [real feelings] of the counsellee through careful liste
6 tant issues are fully discussed and that [real feelings] are expressed rather than avoided . An
7 at prevented him from ever revealing his [real feelings] to any woman . How she regretted those
8 ing process of mystification that denies [real feelings] and experiences is a necessary prop to
9 the play to whom he reveals some of his [real feelings] is Roderigo , but only while using him
10 sked her much sooner if he had known her [real feelings] towards him , but she had been so forma
11 of situation neither can say what their [real feelings] are . A true conversation might be ,
12 clerks are not allowed to express their [real feelings] at work , it is not surprising that the
13 k foolish in public in order to hide his [real feelings] . Men were strange creatures at times .
14 t she could smother the awakening of her [real feelings] for him ? He 'd been important enough t
15 but she hoped she managed to conceal her [real feelings] . Guessing what might greet her in the
16 ight of their honeymoon ? If Ace had any [real feelings] for her he would have taken her prohibi
17 used deliberately as a mask to hide his [real feelings] , she could only guess . ` Let me tak
18 had left him -- but his control over his [real feelings] had remained even then . But what had c
19 ' Relieved that she had not betrayed her [real feelings] , Sophie concentrated on the morning su
20 der has an insight into the Mr. Darcy 's [real feelings] during particular parts of the book . E
248
7.2 Case studies
is the exact proportion also observed with true feelings, so even if you disagree
with one or two of my categorization decisions, there is no significant difference
between the two expressions.
It seems, then, that the semantic prosody Sinclair observes is not attached to
the expression true feelings in particular, but that it is an epiphenomenon the
fact that we typically distinguish between “genuine” (true, real, etc.) emotions
and other emotions in a particular context, namely one where someone is reluc-
tant on unable to express their genuine emotions. Of course, studies of additional
expressions with adjectives meaning “genuine” modifying nouns meaning “emo-
tion” might give us a more detailed and differentiated picture, as might studies
of other nouns modified by adjectives like true (such as true nature, true beliefs,
true intentions, etc.). Such studies are left as an exercise to the reader – this case
study was mainly meant to demonstrate how informal analyses based on the in-
spection of concordances can be integrated into a more rigorous research design
involving quantification and comparison to a set of control data.
249
7 Collocation
this is the strategy we also used in Case Study 7.2.2.1 above in order to determine
semantic differences between high and tall.
We will not follow Stubbs’ discussion in detail here – his focus is on method-
ological issues regarding the best way to identify collocates. Since we decided in
Section 7.1.3 above to stick with the G statistic, this discussion is not central for
us. Stubbs does not present the results of his procedure in detail and the corpus
he uses is not accessible anyway, so let us use the BNC again and extract our
own data.
Table 7.17 shows the result of an attempt to extract direct objects of the verb
cause from the BNC. I searched for the lemma cause where it is tagged as a verb,
followed by zero to three words that are not nouns (to take into account the
occurrence of determiners, adjectives, etc.) and that are not the word by (in order
to exclude passives like caused by negligence, fire, exposure, etc.), followed by a
noun or sequence of nouns, not followed by to (in order to exclude causative
constructions of the form caused the glass to break). This noun, or the last noun
in this sequence, is assumed to be the direct object of cause. The twenty most
frequent nouns are shown in Table 7.17, Column (a).
These collocates clearly corroborate Stubbs’ observation about the negative
semantic prosody of cause. We could now calculate the association strength be-
tween the verb and each of these nouns to get a better idea of which of them are
significant collocates and which just happen to be frequent in the corpus overall.
It should be obvious, however, that the nouns in Table 7.17, Column (a) are not
generally frequent in the English language, so we can assume here that they are,
for the most part, significant collocates.
But even so, what does this tell us about the semantic prosody of the verb
cause? It has variously been pointed out (for example, by Louw & Chateau 2010)
that other verbs of causation also tend to have a negative semantic prosody –
the direct object nouns of bring about in Table 7.17, Column (b) and lead to in
Table 7.17, Column (c) corroborate this. The real question is, again, whether it is
the specific expression [cause NP] that has the semantic prosody in question, or
whether this prosody is found in an entire semantic domain – perhaps speakers
of English have a generally negative view of causation.
In order to determine this, it might be useful to compare different expressions
of causation to each other rather than to the corpus as a whole – to perform a
differentiating collocate analysis: just by inspecting the frequencies in Table 7.17,
it seems that the negative prosody is much weaker for bring about and lead to
than for cause, so, individually or taken together, they could serve as a baseline
against which to compare cause.
250
7.2 Case studies
251
7 Collocation
Table 7.18 shows the results of a differential collocate analysis between cause
on the one hand and the combined collocates of bring about and lead to on the
other.
Table 7.18: Differential collocates for cause compared to bring
about/lead to in the BNC
The negative prosody of the verb cause is even more pronounced than in the
frequency list in Table 7.17: Even the two neutral words change and increase have
disappeared. In contrast, the combined differential collocates of bring about and
lead to as compared to cause, shown in Table 7.19 are neutral or even positive.
We can thus conclude, first, that all three verbal expressions of causation are
likely to be used to some extent with direct object nouns with a negative conno-
tation. However, it is only the verb cause that has a negative semantic prosody.
Even the raw frequencies of nouns occurring in the object position of the three
expressions suggest this: while cause occurs almost exclusively with negatively
252
7.2 Case studies
connoted nouns, bring about and lead to are much more varied. The differential
collocate analysis then confirms that within the domain of causation, the verb
cause specializes in encoding negative caused events, while the other two ex-
pressions encode neutral or positive events. Previous research (Louw & Chateau
2010) misses this difference as it is based exclusively on the qualitative inspection
of concordances.
Thus, the case study shows, once again, the need for strict quantification and
for research designs comparing the occurrence of a linguistic feature under dif-
ferent conditions. There is one caveat of the procedure presented here, however:
while it is a very effective strategy to identify collocates first and categorize them
according to their connotation afterwards, this categorization is then limited to
an assessment of the lexically encoded meaning of the collocates. For example,
problem and damage will be categorized as negative, but a problem does not have
253
7 Collocation
to be negative – it can be interesting if it is the right problem and you are in the
right mood (e.g. [O]ne of these excercises caused an interesting problem for sev-
eral members of the class [Aiden Thompson, Who’s afraid of the Old Testament
God?]). Even damage can be a good thing in particular contexts from particular
perspectives (e.g. [A] high yield of intact PTX [...] caused damage to cancer cells in
addition to the immediate effects of PDT [10.1021/acs.jmedchem.5b01971]). Even
more likely, neutral words like change will have positive or negative connota-
tions in particular contexts, which are lost in the proces of identifying collocates
quantitatively.
Keeping this caveat in mind, however, the method presented in this case study
can be applied fruitfully in more complex designs than the one presented here.
For example, we have treated the direct object position as a simple category here,
but Stefanowitsch & Gries (2003) present data for nominal collocates of the verb
cause in the object position of different subcategorization patterns. While their re-
sults corroborate the negative connotation of cause also found by Stubbs (1995a),
their results add an interesting dimension: while objects of cause in the transitive
construction (cause a problem) and the prepositional dative (cause a problem to
someone) refer to negatively perceived external and objective states, the objects
of cause in the ditransitive refer to negatively experienced internal and/or subjec-
tive states. Studies on semantic prosody can also take into account dimensions
beyond the immediate structural context – for example, Louw & Chateau (2010)
observe that the semantic prosody of cause is to some extent specific to particu-
lar language varieties, and present interesting data suggesting that in scientific
writing it is generally used with a neutral connotation.
254
7.2 Case studies
Second Position
boy girl Total
First Position little 791 1148 1939
(927.53) (1011.47)
small 336 81 417
(199.47) (217.53)
Total 1127 1229 2356
255
7 Collocation
256
7.2 Case studies
257
7 Collocation
This part of the study is more inductive. Stubbs may have expectations about
what he will find, but he essentially identifies collocates exploratively and then
interprets the findings. The nominal collocates show, according to Stubbs, that
small tends to mean ‘small in physical size’ or ‘low in quantity’, while little is
more clearly restricted to quantities, including informal quantifying phrases like
little bit. This is generally true for the BNC data, too (note, however, the one
exception among the top ten collocates – girl).
The connotational difference between the two adjectives becomes clear when
we look at the adjectives they combine with. The word little has strong associa-
tions to evaluative adjectives that may be positive or negative, and that are often
patronizing. Small, in contrast, does not collocate with evaluative adjectives.
Stubbs sums up his analysis by pointing out that small is a neutral word for
describing size, while little is sometimes used neutrally, but is more often “non-
literal and convey[s] connotative and attitudinal meanings, which are often pa-
tronizing, critical, or both.” (Stubbs 1995b: 386). The differences in distribution
relative to the words boy and girl are evidence for him that “[c]ulture is encoded
not just in words which are obviously ideologically loaded, but also in combina-
tions of very common words” (Stubbs 1995b: 387).
Stubbs remains unspecific as to what that ideology is – presumably, one that
treats boys as neutral human beings and girls as targets for patronizing eval-
uation. In order to be more specific, it would be necessary to turn around the
perspective and study all adjectival collocates of boy and girl. Stubbs does not
do this, but Caldas-Coulthard & Moon (2010) look at adjectives collocating with
man, woman, boy and girl in broadsheet and yellow-press newspapers. In order
to keep the results comparable with those reported above, let us stick with the
BNC instead. Table 7.23 shows the top ten adjectival collocates of boy and girl.
The results are broadly similar in kind to those in Caldas-Coulthard & Moon
(2010): boy collocates mainly with neutral descriptive terms (small, lost, big, new),
or with terms with which it forms a fixed expression (old, dear, toy, whipping).
There are the evaluative adjectives rude (which in Caldas-Coulthard and Moon’s
data is often applied to young men of Jamaican descent) and its positively con-
noted equivalent naughty. The collocates of girl are overwhelmingly evaluative,
related to physical appearance. There are just two neutral adjective (other and
dead, the latter tying in with a general observation that women are more often
spoken of as victims of crimes and other activities than men). Finally, there is
one adjective signaling marital status. These results also generally reflect Cal-
das-Coulthard and Moon’s findings (in the yellow-press, the evaluations are of-
ten heavily sexualized in addition).
258
7.2 Case studies
This case study shows how collocation research may uncover facts that go well
beyond lexical semantics or semantic prosody. In this case, the collocates of boy
and girl have uncovered a general attitude that sees the latter as up for constant
evaluation while the former are mainly seen as a neutral default. That the adjec-
tives dead and unmarried are among the top ten collocates in a representative,
relatively balanced corpus, hints at something darker – a patriarchal world view
that sees girls as victims and sexual partners and not much else (other studies
investigating gender stereotypes on the basis of collocates of man and woman
are Gesuato (2003) and Pearce (2008)).
259
8 Grammar
The fact that corpora are most easily accessed via words (or word forms) is also
reflected in many corpus studies focusing on various aspects of grammatical
structure. Many such studies either take (sets of) words as a starting point for
studying various aspects of grammatical structure, or they take easily identifiable
aspects of grammatical structure as a starting point for studying the distribution
of words. However, as the case studies of the English possessive constructions in
Chapters 5 and 6 showed, grammatical structures can be (and are) also studied
in their own right, for example with respect to semantic, information-structural
and other restrictions they place on particular slots or sequences of slots, or with
their distribution across texts, language varieties, demographic groups or vari-
eties.
262
8.2 Case studies
The patterns of a word can be defined as all the words and structures which
are regularly associated with the word and which contribute to its meaning.
A pattern can be identified if a combination of words occurs relatively
263
8 Grammar
264
8.2 Case studies
265
8 Grammar
Keeping this in mind, let us discuss the difference between frequency and per-
centages in some more detail. Note, first, that the reason the results do not change
perceptibly is because Renouf and Sinclair do not determine the percentage of
occurrences of words inside [a __ of ] for all words in the framework, but only
for the twenty nouns that they have already identified as most frequent – the
columns labeled (b) thus represent a ranking based on a mixed strategy of prese-
lecting words by their raw frequency and then ranking them by their proportions
inside and outside the framework.
If we were to omit the pre-selection stage and calculate the percentages for
all words occurring in the framework – as we should, if these percentages are
relevant – we would find 477 words in the BNC that occur exclusively in the
framework, and thus all have an association strength of 100 percent – among
them words that fit the proposed semantic preferences of the pattern, like bar-
relful, words that do not fit, like bomb-burst, hyserisation or Jesuitism, and many
misspellings, like fct (for fact) and numbe and numbr (for number). The prob-
lem here is that percentages, like some other association measures, massively
overestimate the importance of rare events. In order to increase the quality of
the results, let us remove all words that occur five times or less in the BNC. The
twenty words in Table 8.2 are then the words with the highest percentages of
occurrence in the framework [a __ of ].
Table 8.2: The collocational framework [a __ of ] in the BNC by per-
centage of occurrences
266
8.2 Case studies
This list is obviously completely different from the one in Renouf & Sinclair
(1991) or our replication. We would not want to call them typical for [a __ of ], in
the sense that it is not very probable that we will encounter them in this collo-
cational framework. However, note that they generally represent the same cate-
gories as the words in Table 8.1, namely ‘quantity’ and ‘part-whole’, indicating
a relevant relation to the framework. This relation is, in fact, the counterpart to
the one shown in Table 8.1: These are words for which the framework [a __ of ]
is typical, in the sense that if we encounter these words, it is very probable that
they will be accompanied by this collocational framework.
There are words that are typical for a particular framework, and there are
frameworks that are typical for particular words; this difference in perspective
may be of interest in particular research designs (cf. Stefanowitsch & Flach 2016
for further discussion). Generally, however, it is best to use an established associ-
ation measure that will not overestimate rare events. Table 8.3 shows the fifteen
most strongly associated words in the framework based on the G statistic.
Table 8.3: Top collocates of the collocational framework [a __ of ]
(BNC)
267
8 Grammar
(1) odd (21), special (20), different (18), familiar (16), strange (13), sinister (8),
disturbing (5), funny (5), wrong (5), absurd (4), appealing (4), attractive (4),
fishy (4), paradoxical (4), sad (4), unusual (4), impressive (3), shocking (3),
spooky (3), touching (3), unique (3), unsatisfactory (3)
The list clearly supports Hunston and Francis’s claim about the meaning of
this grammar pattern – most of the adjectives are inherently evaluative. There
are a few exceptions – different, special, unusual and unique do not have to be
used evaluatively. If they occur in the pattern [there Vlink something ADJ about
NP], however, they are likely to be interpreted evaluatively.
As Hunston & Francis (2000: 105) point out: “Even when potentially neutral
words such as nationality words, or words such as masculine and feminine, are
used in this pattern, they take on an evaluative meaning”. This is, in fact, a crucial
feature of grammar patterns, as it demonstrates that these patterns themselves
are meaningful and are able to impart their meaning on words occurring in them.
The following examples demonstrate this:
268
8.2 Case studies
The adjective cyclical is neutral, but the adverb horribly shows that it is meant
evaluatively in (2a); dead in its literal sense is purely descriptive, but when ap-
plied to things (like a house in 2b), it becomes an evaluation; finally, metallic is
also neutral, but it is used to evaluate a sound negatively in (2c), as shown by the
phrasing at least ... but.
Instead of listing frequencies, of course, we could calculate the association
strength between the pattern [there Vlink something ADJ about NP] and the ad-
jectives occurring in it. I will discuss in more detail how this is done in the next
subsection; for now, suffice it to say that it would give us the ranking in Table 8.4.
Table 8.4: Most strongly associated adjectives in the pattern [there Vlink
something ADJ about NP]
The ranking does not differ radically from the ranking by frequency in (2)
above, but note that the descriptive adjectives special and different are moved
down the list a few ranks and unique and unusual disappear from the top twenty,
sharpening the semantic profile of the pattern.
This case study is meant to introduce the notion of grammar patterns and to
show that these patterns often have a relatively stable meaning that can be un-
covered by looking at the words that are frequent in (or strongly associated with)
269
8 Grammar
them. Like the preceding case study, it also introduced the idea that the relation-
ship between words and units of grammatical structure can be investigated using
the logic of association measures. The next sections look at this in more detail.
270
8.2 Case studies
Argument Structure
ditransitive ¬ditransitive Total
Verb give 461 574 1035
(8.63) (1026.37)
¬give 687 135 907 136 594
(1139.37) (135 454.63)
Total 1148 136 481 137 629
Collexeme Collexeme with Collexeme with Other verbs Other verbs 𝑝exact
Ditransitive other ASCs with Ditransitive with other ASCs
give 461 687 574 136 942 0
tell 128 660 907 136 969 1.60 × 10−127
send 64 280 971 137 349 7.26 × 10−68
offer 43 152 992 137 477 3.31 × 10−49
show 49 578 986 137 051 2.23 × 10−33
cost 20 82 1015 137 547 1.12 × 10−22
teach 15 76 1020 137 553 4.32 × 10−16
award 7 9 1028 137 620 1.36 × 10−11
allow 18 313 1017 137 316 1.12 × 10−10
lend 7 24 1028 137 605 2.85 × 10−9
The hypothesis is corroborated: the top ten collexemes (and most of the other
significant collexemes not shown here) refer to literal or metaphorical transfer.
However, note that on the basis of lists like that in Table 8.8 we cannot reject
a null hypothesis along the lines of “There is no relationship between the ditran-
sitive and the encoding of transfer events”, since we did not test this. All we can
say is that we can reject null hypotheses stating that there is no relationship
between the ditransitive and each individual verb on the list. In practice, this
may amount to the same thing, but if we wanted to reject the more general null
hypothesis, we would have to code all verbs in the corpus according to whether
they are transfer verbs or not, and then show that transfer verbs are significantly
more frequent in the ditransitive construction than in the corpus as a whole.
271
8 Grammar
Argument Structure
ditransitive ¬ditransitive Total
Verb give 461 574 1035
(212.68) (822.32)
¬give 146 1773 1919
(394.32) (1524.68)
Total 607 2347 2954
Give is significantly more frequent than expected in the ditransitive and less
frequent than expected in the to-dative (𝑝exact = 1.8 × 10−120 ). It is therefore said
to be a significant distinctive collexeme of the ditransitive (again, I will use the
term differential instead of distinctive in the following). Table 8.8 shows the top
ten differential collexemes for each construction.
Generally speaking, the list for the ditransitive is very similar to the one we
get if we calculate the simple collexemes of the construction; crucially, many of
the differential collexemes of the to-dative highlight the spatial distance covered
by the transfer, which is in line with what the hypothesis predicts.
272
8.2 Case studies
Table 8.8: Verbs in the ditransitive and the prepositional dative (ICE-
GB, Gries & Stefanowitsch 2004: 106).
Collexeme Collexeme with Collexeme with Other verbs Other verbs 𝑝exact
Ditransitive to-Dative with Ditransitive with to-Dative
most strongly associated with the ditransitive
give 461 146 574 1773 1.84 × 10−120
tell 128 2 907 1917 8.77 × 10−58
show 49 15 986 1904 8.32 × 10−12
offer 43 15 992 1904 9.95 × 10−10
cost 20 1 1015 1918 9.71 × 10−9
teach 15 1 1020 1918 1.49 × 10−6
wish 9 1 1026 1918 0.00
ask 12 4 1023 1915 0.00
promise 7 1 1028 1918 0.00
deny 8 3 1027 1916 0.01
most strongly associated with the to-dative
bring 7 82 1028 1837 1.47 × 10−9
play 1 37 1034 1882 1.46 × 10−6
take 12 63 1023 1856 0.00
pass 2 29 1033 1890 0.00
make 3 23 1032 1896 0.01
sell 1 14 1034 1905 0.01
do 10 40 1025 1879 0.02
supply 1 12 1034 1907 0.03
read 1 10 1034 1909 0.06
hand 5 21 1030 1898 0.06
corpus linguists occasionally agree with this claim. For example, McEnery & Wil-
son (2001: 11), in their otherwise excellent introduction to corpus-linguistic think-
ing, cite the sentence in (3):
They point out that this sentence will not occur in any given finite corpus, but
that this does not allow us to declare it ungrammatical, since it could simply be
one of infinitely many sentences that “simply haven’t occurred yet”. They then
offer the same solution Chomsky has repeatedly offered:
273
8 Grammar
Argument Structure
ditransitive ¬ditransitive Total
Verb say 0 3333 3333
(44.52) (3288.48)
¬say 1824 131 394 133 218
(1779.48) (131 438.52)
Total 1824 134 727 136 551
Fisher’s exact test shows that the observed frequency of zero differs signifi-
cantly from that expected by chance (𝑝 = 4.3 × 10−165 ) (so does a 𝜒 2 test: 𝜒 2 =
46.25, df = 1, 𝑝 < 0.001). In other words, it is very unlikely that sentences like
Alex said Joe the answer “simply haven’t occurred yet” in the corpus. Instead, we
can be fairly certain that say cannot be used with ditransitive complementation
in English. Of course, the corpus data do not tell us why this is so, but neither
would an acceptability judgment from a native speaker.
Table 8.10 shows the twenty verbs whose non-occurrence in the ditransitive is
statistically most significant in the ICE-GB (see Stefanowitsch 2006b: 67). Since
the frequency of co-occurrence is always zero and the frequency of other words
in the construction is therefore constant, the order of association strength corre-
sponds to the order of the corpus frequency of the words. The point of statistical
testing in this case is to determine whether the absence of a particular word is
significant or not.
Note that since zero is no different from any other frequency of occurrence,
this procedure does not tell us anything about the difference between an inter-
section of variables that did not occur at all and an intersection that occurred
with any other frequency less than the expected one. All the method tells us is
whether an occurrence of zero is significantly less than expected.
274
8.2 Case studies
Collexeme Collexeme with Collexeme with Other verbs Other verbs pexact
Ditransitive to-Dative with Ditransitive with to-Dative
be 0 25 416 1824 109 311 4.29 × 10−165
be|have 0 6261 1824 128 466 3.66 × 10−38
have 0 4303 1824 130 424 2.90 × 10−26
think 0 3335 1824 131 392 1.90 × 10−20
say 0 3333 1824 131 394 1.96 × 10−20
know 0 2120 1824 132 607 3.32 × 10−13
see 0 1971 1824 132 756 2.54 × 10−12
go 0 1900 1824 132 827 6.69 × 10−12
want 0 1256 1824 133 471 4.27 × 10−8
use 0 1222 1824 133 505 6.77 × 10−8
come 0 1140 1824 133 587 2.06 × 10−7
look 0 1099 1824 133 628 3.59 × 10−7
try 0 749 1824 133 978 4.11 × 10−5
mean 0 669 1824 134 058 1.21 × 10−4
work 0 646 1824 134 081 1.65 × 10−4
like 0 600 1824 134 127 3.08 × 10−4
feel 0 593 1824 134 134 3.38 × 10−4
become 0 577 1824 134 150 4.20 × 10−4
happen 0 523 1824 134 204 8.70 × 10−4
put 0 513 1824 134 214 9.96 × 10−4
In other words, the method makes no distinction between zero occurrence and
other less-frequent-than-expected occurrences. However, Stefanowitsch (2006b:
70f) argues that this is actually an advantage: if we were to treat an occurrence
of zero as special opposed to, say, an occurrence of 1, then a single counterex-
ample to an intersection of variables hypothesized to be impossible will appear
to disprove our hypothesis. The knowledge that a particular intersection is sig-
nificantly less frequent than expected, in contrast, remains relevant even when
faced with apparent counterexamples. And, as anyone who has ever elicited ac-
ceptability judgments – from someone else or introspectively from themselves –
knows, the same is true of such judgments: We may strongly feel that something
is unacceptable even though we know of counterexamples (or can even think of
such examples ourselves) that seem possible but highly unusual.
Of course, applying significance testing to zero occurrences of some intersec-
tion of variables is not always going to provide a significant result: if one (or
both) of the values of the intersection are rare in general, an occurrence of zero
may not be significantly less than expected. In this case, we still do not know
whether the absence of the intersection is due to chance or to its impossibility
– but with such rare combinations, acceptability judgments are also going to be
variable.
275
8 Grammar
(4) a. It rained almost every day, and she began to feel imprisoned. (BNC
A7H)
b. [T]here was a slightly strained period when he first began working with
the group... (BNC AT1)
c. The baby wakened and started to cry. (BNC CB5)
d. Then an acquaintance started talking to me and diverted my attention.
(BNC ABS)
Schmid’s study is deductive. He starts by deriving from the literature two hy-
potheses concerning the choice between the verbs begin and start on the one
hand and to- and the ing-complements on the other: First, that begin signals
gradual onsets and start signals sudden ones, and second, that ing-clauses are
typical of dynamic situations while to-clauses are typical of stative situations.
His focus is on the second hypothesis, which he tests on the LOB corpus by,
first, identifying all occurrences of both verbs with both complementation pat-
terns and, second, categorizing them according to whether the verb in the com-
plement clause refers to an activity, a process or a state. The study involves three
276
8.2 Case studies
nominal variables: Matrix Verb (with the values begin and start), Comple-
mentation (with the values ing-clause and to-clause) and Aktionsart (with
the values activity, process and state). Thus, we are dealing with a multivari-
ate design. The prediction with respect to the complementation pattern is clear –
to-complements should be associated with activities and ing-complements with
states, with processes falling somewhere in between. There is no immediate pre-
diction with respect to the choice of verb, but Schmid points out that activities
are more likely to have a sudden onset, while states and processes are more likely
to have a gradual onset, thus the former might be expected to prefer start and
the latter begin.
Schmid does not provide an annotation scheme for the categories activity, pro-
cess and state, discussing these crucial constructs in just one short paragraph:
Essentially, the three possible types of events that must be considered in the
context of a beginning are activities, processes and states. Thus, the speaker
may want to describe the beginning of a human activity like eating, working
or singing; the beginning of a process which is not directly caused by a
human being like raining, improving, ripening; or the beginning of a state.
Since we seem to show little interest in the beginning of concrete, visible
states (cf. e.g. ? The lamp began to stand on the table.) the notion of state is
in the present context largely confined to bodily, intellectual and emotive
states of human beings. Examples of such “private states” (Quirk et al. 1985:
202f) are being ill, understanding, loving.
Quirk et al. (1985: 202–203) mention four types of “private states”: intellectual
states, like know, believe, realize; states of emotion or attitude, like intend, want,
pity; states of perception, like see, smell, taste; and states of bodily sensation, like
hurt, itch, feel cold. Based on this and Schmid’s comments, we might come up
with the following rough annotation scheme which thereby becomes our opera-
tionalization for Aktionsart:
277
8 Grammar
Schmid also does not state how he deals with passive sentences like those in
(6a, b):
(6) a. [T]he need for correct and definite leadership began to be urgently
felt... (LOB G03)
b. Presumably, domestic ritual objects began to be made at much the same
time. (LOB J65)
Annotating all 348 hits according to the annotation scheme sketched out above
yields the data in Table 8.11 (the complete annotated concordance is part of the
Supplementary Online Material). The total frequencies as well as the proportions
among the categories differ slightly from the data Schmid reports, but the results
are overall very similar.
As always in a configural frequency analysis, we have to correct for multiple
testing: there are twelve cells in our table, so our probability of error must be
lower than 0.05/12 = 0.0042. The individual cells (i.e., intersections of variables)
have one degree of freedom, which means that our critical 𝜒 2 value is 8.20. This
means that there are two types and three antitypes that reach significance: as
predicted, activity verbs are positively associated with the verb start in combi-
nation with ing-clauses and state verbs are positively associated with the verb
begin in combination with to-clauses. Process verbs are also associated with be-
gin with to-clauses, but only marginally significantly so. As for the antitypes,
all three verb classes are negatively associated (i.e., less frequent than expected)
with the verb begin with ing-clauses, which suggests that this complementation
pattern is generally avoided with the verb begin, but this avoidance is particu-
larly pronounced with state and process verbs, where it is statistically significant:
the verb begin and state/process verbs both avoid ing-complementation, and this
avoidance seems to add up when they are combined.
278
8.2 Case studies
begin start
Aktionsart ing to Total ing to Total Total
activity Obs.: 22 Obs.: 93 115 Obs.: 45 Obs.: 23 68 183
Exp.: 29.86 Exp.: 106.86 Exp.: 10.11 Exp.: 36.17
𝜒 2: 2.07 𝜒 2: 1.80 𝜒 2: 120.48 𝜒 2: 4.80
process Obs.: 1 Obs.: 65 66 Obs.: 5 Obs.: 10 15 81
Exp.: 13.22 Exp.: 47.30 Exp.: 4.47 Exp.: 16.01
𝜒 2: 11.29 𝜒 2: 6.62 𝜒 2: 0.06 𝜒 2: 2.26
state Obs.: 1 Obs.: 78 79 Obs.: 2 Obs.: 3 5 84
Exp.: 13.71 Exp.: 49.05 Exp.: 4.64 Exp.: 16.60
𝜒 2: 11.78 𝜒 2: 17.08 𝜒 2: 1.50 𝜒 2: 11.14
Total 24 236 260 52 36 88 348
Faced with these results, we might ask, first, how they relate to two simpler
tests of Schmid’s hypothesis – namely two bivariate designs separately testing (a)
the relationship between Aktionsart and Complementation, (b) the relation-
ship between Aktionsart and Matrix Verb and (c) the relationship between
Matrix Verb and Complementation Type. We have all the data we need to test
this in Table 8.11, we just have to sum up appropriately. Table 8.12 shows that Ak-
tionsart and Complementation Type are related: activity verbs prefer ing,
the other two verb types prefer to (𝜒 2 = 49.702, df = 2, 𝑝 < 0.001 ).
Table 8.12: Aktionsart and complementation type (LOB)
Complementation Type
ing to Total
Aktionsart activity 67 116 183
(39.97) (143.03)
process 6 75 81
(17.69) (63.31)
state 3 81 84
(18.34) (65.66)
Total 76 272 348
Table 8.13 shows that Aktionsart and Matrix Verb are also related: activ-
ity verbs prefer to and state verbs prefer ing (𝜒 2 = 32.236, df = 2, 𝑝 < 0.001).
279
8 Grammar
Matrix Verb
begin start Total
Aktionsart activity 115 68 183
(136.72) (46.28)
process 66 15 81
(60.52) (20.48)
state 79 5 84
(62.76) (21.24)
Total 260 88 348
Finally, Table 8.14 shows that Matrix Verb and Complementation Type are
also related: begin prefers to and start prefers ing (𝜒 2 = 95.755, df = 1, 𝑝 <
0.001).
Table 8.14: Matrix verb and complementation type
Complementation Type
ing to Total
Matrix Verb begin 24 236 260
(56.78) (203.22)
start 52 36 88
(19.22) (68.78)
Total 76 272 348
In other words, every variable in the design is related to every other variable
and the multivariate analysis in Table 8.11 shows that the effects observed in the
individual bivariate designs simply add up when all three variables are investi-
gated together. Thus, we cannot tell whether any of the three relations between
the variables is independent of the other two. In order to determine this, we
would have to keep each of the variables constant in turn to see whether the
other two still interact in the predicted way (for example, whether begin prefers
to-clauses and start prefers ing-clauses even if we restrict the analysis to activity
verbs, etc.
280
8.2 Case studies
The second question we could ask faced with Schmid’s results is to what extent
his second hypothesis – that begin is used with gradual beginnings and start with
sudden ones – is relevant for the results. As mentioned above, it is not tested
directly, so how could we remedy this? One possibility is to look at each of the
348 cases in the sample and try to determine the gradualness or suddenness of
the beginning they denote. This is sometimes possible, as in (4a) above, where
the context suggests that the referent of the subject began to feel imprisoned
gradually the longer the rain went on, or in (4c), which suggests that the crying
began suddenly as soon as the baby woke up. But in many other cases, it is very
difficult to judge, whether a beginning is sudden or gradual – as in (4b, c). To
come up with a reliable annotation scheme for this categorization task would be
quite a feat.
There is an alternative, however: speakers sometimes use adverbs that explic-
itly refer to the type of beginning. A query of the BNC for ⟨ [word=".*ly"%c]
[hw="(begin|start)"] ⟩ yields three relatively frequent adverbs (occurring more
than 10 times) indicating suddenness (immediately, suddenly and quickly), and
three indicating gradualness (slowly, gradually and eventually). There are more,
like promptly, instantly, rapidly. leisurely, reluctantly, etc., which are much less
frequent.
By extracting just these cases, we can directly test the hypothesis that begin
signals gradual and start sudden onsets. The BNC contains 307 cases of [begin/
start to Vinf ] and [begin/start Ving] directly preceded by one of the six adverbs
mentioned above. Table 8.15 shows the result of a configural frequency analysis
of the variables Onset (sudden vs. gradual, Matrix Verb (begin vs. start and
Complementation Type (ing vs. to).
begin start
Onset ing to Total ing to Total Total
sudden Obs.: 25 Obs.: 85 110 Obs.: 58 Obs.: 36 94 204
Exp.: 36.60 Exp.: 89.65 Exp.: 22.54 Exp.: 55.21
𝜒 2: 3.68 𝜒 2: 0.24 𝜒 2: 55.79 𝜒 2: 6.68
gradual Obs.: 1 Obs.: 79 80 Obs.: 5 Obs.: 18 23 103
Exp.: 18.48 Exp.: 45.27 Exp.: 11.38 Exp.: 27.87
𝜒 2: 16.53 𝜒 2: 25.14 𝜒 2: 3.58 𝜒 2: 3.50
Total 26 164 190 63 54 117 307
281
8 Grammar
Since there are eight cells in the table, the corrected 𝑝-value is 0.05/8 = 0.00625;
the individual cells have one degree of freedom, so the critical 𝜒 2 value is 7.48.
There are two significant types: sudden ∩ start ∩ ing and gradual ∩ begin ∩ to.
There is one significant and one marginally significant antitype: sudden ∩ start
∩ to and gradual ∩ begin ∩ ing, respectively. This corroborates the hypothesis
that begin signals gradual onsets and start signals sudden ones, at least when the
matrix verbs occur with their preferred complementation pattern.
Summing up the results of both studies, we could posit two “prototype” patterns
(in the sense of cognitive linguistics): [begingradual Vstative ing ] and [start sudden to
Vactivity inf ], and we could hypothesize that speakers will choose the pattern that
matches most closely the situation they are describing (something that could
then be tested, for example, in a controlled production experiment).
This case study demonstrated a complex design involving grammar, lexis and
semantic categories. It also demonstrated that semantic categories can be in-
cluded in a corpus linguistic design in the form of categorization decisions on the
basis of an annotation scheme (in which case, of course, the annotation scheme
must be documented in sufficient detail for the study to be replicable), or in the
form of lexical items signaling a particular meaning explicitly, such as adverbs
of gradualness (in which case we need a corpus large enough to contain a suffi-
cient number of hits including these items). It also demonstrated that such cor-
pus-based studies may result in very specific hypotheses about the function of
lexicogrammatical structures that may become the basis for claims about mental
representation.
282
8.2 Case studies
283
8 Grammar
284
8.2 Case studies
Observed Expected 𝜒2
More Frequent ADJ1 34 29.50 0.69
ADJ2 25 29.50 0.69
Total 59 1.37
285
8 Grammar
286
Table 8.18: Sample of monosyllabic binomials and their sonority
Expression Order A Order B Frozenness More Expression Order A Order B Frozenness More Expression Order A Order B Frozenness More
Sonor. Sonor. Sonor.
beck and call 40 0 1.00 2nd (contd.) (contd.)
bread and jam 30 0 1.00 2nd iron and steel 127 3 0.98 2nd love and care 33 6 0.85 2nd
bride and groom 94 0 1.00 2nd bread and wine 42 1 0.98 2nd boy and girl 43 8 0.84 1st
cat and mouse 37 0 1.00 2nd wife and son 35 1 0.97 2nd food and wine 74 14 0.84 2nd
day and age 106 0 1.00 1st head and tail 34 1 0.97 2nd age and sex 145 32 0.82 1st
fire and life 31 0 1.00 1st rise and fall 170 5 0.97 2nd meat and fish 30 7 0.81 2nd
life and limb 44 0 1.00 2nd song and dance 68 2 0.97 1st war and peace 71 17 0.81 1st
light and shade 56 0 1.00 2nd sight and sound 33 1 0.97 2nd time and place 187 45 0.81 1st
park and ride 45 0 1.00 2nd start and end 33 1 0.97 2nd road and rail 67 18 0.79 2nd
pay and file 33 0 1.00 1st stock and barrel 30 1 0.97 2nd arm and leg 31 9 0.78 1st
rank and file 159 0 1.00 2nd bread and cheese 81 3 0.96 2nd land and sea 51 15 0.77 2nd
right and wrong 126 0 1.00 2nd wife and child 54 2 0.96 1st north and west 112 36 0.76 1st
rock and roll 106 0 1.00 2nd hand and foot 52 2 0.96 1st day and night 310 101 0.75 1st
tooth and nail 38 0 1.00 2nd church and state 102 4 0.96 1st date and time 74 25 0.75 2nd
touch and go 43 0 1.00 2nd knife and fork 87 4 0.96 1st date and place 28 10 0.74 2nd
track and field 47 0 1.00 2nd oil and gas 392 26 0.94 1st mind and body 138 51 0.73 2nd
life and work 147 1 0.99 1st front and rear 30 2 0.94 2nd south and east 119 44 0.73 1st
flesh and blood 109 1 0.99 1st fruit and veg 36 3 0.92 2nd home and school 43 16 0.73 2nd
fish and chip 103 1 0.99 1st pride and joy 66 6 0.92 2nd north and east 68 27 0.72 1st
mum and dad 490 5 0.99 1st food and fuel 30 4 0.88 2nd snow and ice 53 22 0.71 1st
food and drink 333 4 0.99 1st stress and strain 37 5 0.88 2nd size and shape 116 53 0.69 1st
horse and cart 74 1 0.99 1st face and neck 59 9 0.87 1st south and west 73 37 0.66 1st
ebb and flow 68 1 0.99 2nd wind and rain 94 15 0.86 2nd science and art 30 16 0.65 1st
man and wife 63 1 0.98 1st size and weight 31 5 0.86 1st time and cost 30 21 0.59 1st
heart and soul 57 1 0.98 2nd heart and lung 30 5 0.86 2nd nose and mouth 35 26 0.57 1st
hip and thigh 50 1 0.98 2nd league and cup 41 7 0.85 1st care and skill 36 30 0.55 1st
lock and key 44 1 0.98 2nd head and neck 79 14 0.85 1st
287
8.2 Case studies
8 Grammar
Finally, we need to code the final consonants of all nouns for sonority and de-
termine which of the two final consonants is more sonorant – that of the first
noun, or that of the second. For this, let us use the following (hopefully uncon-
troversial) sonority hierarchy:
(8) [vowels] > [semivowels] > [liquids] > [h] > [nasals] > [voiced fricatives] >
[voiceless fricatives] > [voiced affricates] > [voiceless affricates] > [voiced
stops] > [voiceless stops]
The result of all these steps is shown in Table 8.18. The first column shows the
binomial in its most frequent order, the second column gives the frequency of
the phrase in this order, the third column gives the frequency of the less frequent
order (the reversal of the one shown), the fourth column gives the degree of
frozenness (i.e., the percentage of the more frequent order), and the fifth column
records whether the final consonant of the first or of the second noun is less
sonorant.
Let us first simply look at the number of cases for which the claim is true
or false. There are 42 cases where the second word’s final consonant is more
sonorant than that of the second word (as predicted), and 36 cases where the
second word’s final consonant is less sonorant than that of the first (counter to
the prediction). As Table 8.19 shows, this difference is nowhere near significant.
Table 8.19: Sonority of the final consonant and word order in binomials
(counts)
Observed Expected 𝜒2
More Sonorant First Word 36 39 0.23
Second Word 42 39 0.23
Total 78 0.46
However, note that we are including both cases with a very high degree of
frozenness (like beck and call, flesh and blood, or lock and key) and cases with
a relatively low degree of frozenness (like nose and mouth, day and night, or
snow and ice: this will dilute our results, as the cases with low frozenness are not
predicted to adhere very strongly to the less-before-more-sonorant principle.
We could, of course, limit our analysis to cases with a high degree of frozen-
ness, say, above 90 percent (the data is available, so you might want to try). How-
ever, it would be even better to keep all our data and make use of the rank order
288
8.2 Case studies
that the frozenness measure provides: the prediction is that cases with a high
frozenness rank will adhere to the sonority constraint with a higher probability
than those with a low frozenness rank. Table 8.18 contains all the data we need
to determine the median of words adhering or not adhering to the constraint, as
well as the rank sums and number of cases, which we need to calculate a U -test.
We will not go through the test step by step (but you can try for yourself if you
want to). Table 8.20 provides the necessary values derived from Table 8.18.
Table 8.20: Sonority of the final consonant and word order in binomials
(ranks)
289
8 Grammar
variation (e.g. Rohdenburg 1995; 2003). The general idea is that in contexts that
are already complex, speakers will try to choose a variant that reduces (or at
least does not contribute further to) this complexity. A particularly striking ex-
ample of this is what Rohdenburg (adapting a term from Karl Brugmann) calls
the horror aequi principle: “the widespread (and presumably universal) tendency
to avoid the repetition of identical and adjacent grammatical elements or struc-
tures” (Rohdenburg 2003: 206).
Take the example of matrix verbs that normally occur alternatively with a
to-clause or an ing-clause (like begin and start in Case Study 8.2.3.1 above). Ro-
hdenburg (1995: 380) shows, on the basis of the text of an 18th century novel, that
where such matrix verbs occur as to-infinitives, they display a strong preference
for ing-complements. For example, in the past tense, start freely occurs either
with a to-complement, as in (9a) or with an ing-complement, as in (9b). However,
as a to-infinitive it would avoid the to-complement, although not completely, as
(9c) shows, and strongly prefer the ing-form, as in (9d):
290
8.2 Case studies
Table 8.21 shows the observed and expected frequencies of to- and ing-comple-
ments for each of these forms, together with the expected frequencies and the
𝜒 2 components.
Table 8.21: Complementation of start and the horror aequi principle
Complementation Type
Noun Category ing to Total
∅ start Obs.: 2706 Obs.: 1434 4140
Exp.: 2249.15 Exp.: 1890.85
𝜒 2: 92.80 𝜒 2: 110.38
to start Obs.: 1108 Obs.: 74 1182
Exp.: 642.15 Exp.: 539.85
𝜒 2 : 337.96 𝜒 2 : 402.00
starts Obs.: 496 Obs.: 582 1078
Exp.: 585.65 Exp.: 492.35
𝜒 2: 13.72 𝜒 2: 16.32
started Obs.: 3346 Obs.: 3416 6762
Exp.: 3673.61 Exp.: 3088.39
𝜒 2: 29.22 𝜒 2: 34.75
starting Obs.: 40 Obs.: 964 1004
Exp.: 545.45 Exp.: 458.55
𝜒 2 : 468.38 𝜒 2 : 557.13
The most obvious and, in terms of their 𝜒 2 components, most significant de-
viations from the expected frequencies are indeed those cases where the matrix
verb start has the same form as the complement clause: there are far fewer cases
of [to start to Vinf ] and [starting Vpres.part. ] and, conversely, far more cases of
[to start Vpres.part. ] and [starting to Vinf ] than expected. Interestingly, if the base
form of start does not occur with an infinitive particle, the to-complement is still
strongly avoided in favor of the ing-complement, though not as strongly as in
291
8 Grammar
the case of the base form with the infinitive particle. It may be that horror aequi
is a graded principle – the stronger the similarity, the stronger the avoidance.
This case study is intended to introduce the notion of horror aequi, which has
been shown to influence a number of grammatical and morphological variation
phenomena (cf. e.g. Rohdenburg 2003, Vosberg 2003, Rudanko 2003, Gries &
Hilpert 2010). Methodologically, it is a straightforward application of the 𝜒 2 test,
but one where the individual cells of the contingency table and their 𝜒 2 values
are more interesting than the question whether the observed distribution as a
whole differs significantly from the expected one.
292
8.2 Case studies
293
8 Grammar
This leaves 69 targets that are preceded by a comparative prime in the preced-
ing span of 20 tokens. Table 8.22 shows the distribution of analytic and synthetic
primes and targets (if there was more than one comparative in the preceding
context, only the one closer to the comparative in question was counted).
Table 8.22: Comparatives preceding comparatives in a context of 20
tokens (BNC)
Target
synthetic analytic Total
Prime synthetic 34 13 47
(27.93) (19.07)
analytic 7 15 22
(13.07) (8.93)
Total 28 41 69
(11) a. But the statistics for the second quarter, announced just before the
October Conference of the Conservative Party, were even more dam-
aging to the Government showing a rise of 17 percent on 1989. Indeed
these figures made even sorrier reading for the Conservatives when
one realised... (BNC G1J)
b. Over the next ten years, China will become economically more liberal,
internationally more friendly... (BNC ABD)
294
8.2 Case studies
factoid that short term memory can hold up to seven units). Table 8.23 shows
the distribution of primes and targets in this smaller window.
Table 8.23: Comparatives preceding comparatives in a context of 7 to-
kens (BNC)
Prime
synthetic analytic Total
Target synthetic 21 7 28
(15.40) (12.60)
analytic 1 11 12
(6.60) (5.40)
Total 22 18 40
295
8 Grammar
frequently than men, but also shows qualitative differences in terms of their form
and function.
This kind of analysis requires very careful, largely manual data extraction and
annotation so it is limited to relatively small corpora, but let us see what we
can do in terms of a larger-scale analysis. Let us focus on tag questions with
negative polarity containing the auxiliary be (e.g. isn’t it, wasn’t she, am I not,
was it not). These can be extracted relatively straightforwardly even from an
untagged corpus using the following queries:
The query in (12a) will find all finite forms of the verb be (as non-finite forms
cannot occur in tag questions), followed by the negative clitic n’t, followed by
a pronoun; the query in (12b) will do the same thing for the full form of the
particle not, which then follows rather than precedes the pronoun. Both queries
will only find those cases that occur before a punctuation mark signaling a clause
boundary (what to include here will depend on the transcription conventions of
the corpus, if it is a spoken one).
The queries are meant to work for the spoken portion of the BNC, which uses
the comma for all kinds of things, including hesitation or incomplete phrases, so
we have to make a choice whether to exclude it and increase the precision or to
include it and increase the recall (I will choose the latter option). The queries are
not perfect yet: British English also has the form ain’t it, so we might want to
include the query in (13a). However, ain’t can stand for be or for have, which low-
ers the precision somewhat. Finally, there is also the form innit in (some varieties
of) British English, so we might want to include the query in (13b). However, this
is an invariant form that can occur with any verb or auxiliary in the main clause,
so it will decrease the precision even further. We will ignore ain’t and innit here
(they are not particularly frequent and hardly change the results reported below):
In the part of the spoken BNC annotated for speaker sex, there are 3751 hits
for the patterns in (12a,b) for female speakers (only 20 of which are for 12b), and
3050 hits for male speakers (only 42 of which are for 12b). Of course, we cannot
296
8.2 Case studies
assume that there is an equal amount of male and female speech in the corpus,
so the question is what to compare these frequencies against. Obviously, such
tag questions will normally occur in declarative sentences with positive polarity
containing a finite form of be. Such sentences cannot be retrieved easily, so it
is difficult to determine their precise frequency, but we can estimate it. Let us
search for finite forms of be that are not followed by a negative clitic (isn’t) or
particle (is not) within the next three tokens (to exclude cases where the particle
is preceded by an adverb, as in is just/obviously/... not). There are 146 493 such
occurrences for female speakers and 215 219 for male speakers. The query will
capture interrogatives, imperatives, subordinate clauses and other contexts that
cannot contain tag questions, so let us draw a sample of 100 hits from both sam-
ples and determine how many of the hits are in fact declarative sentences with
positive polarity that could (or do) contain a tag question. Let us assume that we
find 67 hits in the female-speaker sample and 71 hits in the male-speaker sam-
ple to be such sentences. We can now adjust the total results of our queries by
multiplying them with 0.67 and 0.71 respectively, giving us 98 150 sentences for
female and 152 806 sentences for male speakers. In other words, male speakers
produce 60.89 percent of the contexts in which a negative polarity tag question
with be could occur. We can cross-check this by counting the total number of
words uttered by male and female speakers in the spoken part of the BNC: there
are 5 654 348 words produced by men and 3 825 804 words produced by women,
which means that men produce 59.64 percent of the words, which fits our esti-
mate very well.
These numbers can now be used to compare the number of tag questions
against, as shown in Table 8.24. Since the tag questions that we found using
our queries have negative polarity, they are not included in the sample, but must
occur as tags to a subset of the sentences. This means that by subtracting the
number of tag questions from the total for each group of speakers, we get the
number of sentences without tag questions.
The difference between male and female speakers is highly significant, with
female speakers using substantially more tag questions than expected, and male
speakers using substantially fewer (𝜒 2 = 743.07, df = 1, 𝑝 < 0.001).
This case study was intended to introduce the study of sex-related differences
in grammar (or grammatical usage); cf. Mondorf (2004a) for additional studies
and an overview of the literature. It was also intended to demonstrate the kinds
of steps necessary to extract the required frequencies for a grammatical research
question from an untagged corpus, and the ways in which they might be esti-
mated if they cannot be determined precisely. Of course, these steps and consider-
ations depend to a large extent on the specific phenomenon under investigation;
297
8 Grammar
Table 8.24: Negative polarity tag questions in male and female speech
in the spoken BNC
Speaker Sex
female male Total
Tag Question with 3771 3092 6863
(2684.15) (4178.85)
without 94 379 149 714 244 093
(95 465.85) (148 627.15)
Total 98 150 152 806 250 956
one reason for choosing tag questions with be is that they, and the sentences
against which to compare their frequency, are much easier to extract from an
untagged corpus than is the case for tag questions with have, or, worst of all, do
(think about all the challenges these would confront us with).
298
8.2 Case studies
1961 1991
BrE AmE Total BrE AmE Total Total
preposition Obs.: 12 Obs.: 3 15 Obs.: 7 Obs.: 5 12 27
Exp.: 6.63 Exp.: 5.05 Exp.: 8.70 Exp.: 6.63
𝜒 2: 4.36 𝜒 2: 0.83 𝜒 2: 0.33 𝜒 2: 0.40
tablevspace postposition Obs.: 0 Obs.: 1 1 Obs.: 2 Obs.: 7 9 10
Exp.: 2.45 Exp.: 1.87 Exp.: 3.22 Exp.: 2.45
𝜒 2: 2.45 𝜒 2: 0.40 𝜒 2: 0.46 𝜒 2: 8.42
Total 12 4 16 9 12 21 37
While the prepositional use is more frequent in both corpora from 1961, the
postpositional use is the more frequent one in the American English corpus from
1991. A CFA shows that the intersection 1991 ∩ am.engl ∩ postposition is the
only one whose observed frequencies differ significantly from the expected.
Due to the small number of cases, we would be well advised not to place too
much confidence in our results, but as it stands they fully corroborate Berlage’s
claims that British English prefers the prepositional use and American English
has recently begun to prefer the postpositional use.
This case study is intended to provide a further example of a multivariate de-
sign and to show that even small data sets may provide evidence for or against
a hypothesis. It is also intended to introduce the study of the convergence and/
or divergence of varieties and the basic design required. This field of studies is
of interest especially in the case of pluricentric languages like, for example, En-
glish, Spanish or Arabic (see Rohdenburg & Schlüter (2009), from which Berlage’s
study is taken, for a broad, empirically founded introduction to the contrastive
study of British and American English grammar; cf. also Leech & Kehoe (2006)).
299
8 Grammar
(14) Therefore while thys onhappy sowle by the vyctoryse pompys of her en-
myes was goyng to be broughte into helle for the synne and onleful lustys
of her body.
Mair also notes that the going-to future is mentioned in grammars from 1646
onward; at the very latest, then, it was established at the end of the 17th century. If
a rise in discourse frequency is a precondition for grammaticalization, we should
see such a rise in the period leading up to the end of the 17th century; if not, we
should see such a rise only after this point.
300
8.2 Case studies
Figure 8.1 shows Mair’s results based on the OED citations, redrawn as closely
as possible from the plot he presents (he does not report the actual frequencies).
It also shows frequency data for the query ⟨ [word="going"%c] [word="to"%c]
[pos="V.*"] ⟩ from the periods covered by the CLMET, LOB and FLOB. Note
that Mair categorizes his data in quarter centuries, so the same has to be done
for the CLMET. Most texts in the CLMET are annotated for a precise year of
publication, but sometimes a time span is given instead. In these cases, let us put
the texts into the quarter century that the larger part of this time span falls into.
LOB and FLOB represent the quarter-centuries in which they were published.
One quarter century is missing: there is no accessible corpus of British English
covering the period 1926–1950. Let us extrapolate a value by taking the mean of
the period preceding and following it. To make the corpus data comparable to
Mair’s, there is one additional step that is necessary: Mair plots frequencies per
10 000 citations; citations in the relevant period have a mean length of 12 words
(see Hoffmann 2004: 25) – in other words, Mair’s frequencies are per 120 000
words, so we have to convert our raw frequencies into frequencies per 120 000
words too. Table 8.26 shows the raw frequencies, normalized frequencies and
data sources.
Table 8.26: Discourse frequency of going to
The grey line in Figure 8.1 shows Mair’s conservative estimate for the point at
which the construction was firmly established as a way to express future tense.
301
8 Grammar
40
30
Frequency (per 120 000 words)
25
20
15
10
5
gonna
0
1601-1625
1626-1650
1651-1675
1676-1700
1701-1725
1726-1750
1751-1775
1776-1800
1801-1825
1826-1850
1851-1875
1876-1900
1901-1925
1926-1950
1951-1975
1976-2000
Figure 8.1: Grammaticalization and discourse frequency of going to
As the results from the OED citations and from the corpora show, there was only
a small rise in frequency during the time that the construction became estab-
lished, but a substantial jump in frequency afterwards. Interestingly, around the
time of that jump, we also find the first documented instances of the contracted
form gonna (from Mair’s data – the contracted form is not frequent enough in the
corpora used here to be shown). These results suggest that semantic reanalysis
is the first step in grammaticalization, followed by a rise in discourse frequency
accompanied by phonological reduction.
This case study demonstrates that very large collections of citations can indeed
be used as a corpus, as long as we are investigating phenomena that are likely
to occur in citations collected to illustrate other phenomena; the results are very
similar to those we get from well-constructed linguistic corpora. The case study
also demonstrates the importance of corpora in diachronic research, a field of
study which, as mentioned in Chapter 1 has always relied on citations drawn
from authentic texts, but which can profit from querying large collections of
such texts and quantifying the results.
302
8.2 Case studies
303
8 Grammar
304
8.2 Case studies
This design can be applied deductively, if we have hypotheses about the gen-
der-specific usage of particular (sets of) verbs, or inductively, if we simply cal-
culate the association strength of all verbs to one pronoun as compared to the
other. In either case we have two nominal variables, Subject of Quoted Speech,
with the variables male (he) and female (she), and Speech Activity Verb with
all occurring verbs as its values. Table 8.27 shows the results of an inductive
application of the design to the BNC.
There is a clear difference that corroborates Caldas-Coulthard’s casual observa-
tion: the top ten verbs of communication associated with men contain five verbs
conveying a rough, unpleasant and/or aggressive manner of speaking (growl,
grate, rasp, snarl, roar), while those for women only include one (snap, related
to irritability rather than outright aggression). Interestingly, two very general
communication verbs, say and write, are also typical for men’s reported speech.
Women’s speech is introduced by verbs conveying weakness or communicative
subordination (whisper, cry, manage, protest, wail and deny).
305
8 Grammar
The crucial counterexample here would be one like (15b), with an infinitival
complement that expresses knowledge that is “public” rather than “personal/
experiential”; also of interest would be examples with that-clauses that express
personal/experiential knowledge. The corresponding queries are easy enough to
define:
This query follows the specific example in (15b) very narrowly – we could, of
course, define a broader one that would capture, for example, proper names and
noun phrases in addition to pronouns. But remember that we are looking for
counterexamples, and if we can find these with a query following the structure
of supposedly non-acceptable sentences very closely, they will be all the more
convincing.
The BNC contains not just one, but many counterexamples. Here are some
examples with that-complements expressing subjective, personal knowledge:
(17) a. Erika was surprised to find that she was beginning to like Bach (BNC
A7A)
b. [A]che of loneliness apart, I found that I was stimulated by the chal-
lenge of finding my way about this great and beautiful city. (BNC
AMC)
And here are some with to-complements expressing objective, impersonal knowl-
edge:
(18) a. Li and her coworkers have been able to locate these sequence varia-
tions ... in the three-dimensional structure of the toxin, and found them
to be concentrated in the 𝛽 sheets of domain II. (BNC ALV)
b. The visiting party, who were the first and last ever to get a good look at
the crater of Perboewetan, found it to be about 1000 metres in diameter
and about fifty metres deep. (BNC ASR)
306
8.2 Case studies
These counterexamples (and others not cited here) in fact give us a new hy-
pothesis as to what the specific semantic contribution of the to-complement may
be: if used to refer to objective knowledge, it overwhelmingly refers to situations
where this objective knowledge was not previously known to the participants
of the situation described. In fact, if we extend our search for counterexamples
beyond the BNC to the world-wide web, we find examples that are even more
parallel to (15b), such as (19), all produced by native speakers of English:
(19) a. Afterwards found that French had stopped ship and found her to be
German masquerading as Greek. (www.hmsneptune.com)
b. I was able to trace back my Grandfathers name ... to Scotland and then
into Sussex and Surrey, England. Very exciting! Reason being is be-
cause we were told that the ancestors were Irish and we found them
to be Scottish! (www.gettheeblowdryer.com)
c. Eishaus – This place is an Erlangen institution, should you ever get to
meet the owner you’ll find him to be American, he runs this wonderful
ice cream shop as his summer job away from his ‘proper’ job in the
states. (theerlangenexpat.wordpress.com)
307
9 Morphology
We saw in Chapter 8 that the wordform-centeredness of most corpora and cor-
pus-access tools requires a certain degree of ingenuity when studying structures
larger than the word. It does not pose particular problems for corpus-based mor-
phology, which studies structures smaller than the word. Corpus morphology
is mostly concerned with the distribution of affixes, and retrieving all occur-
rences of an affix plausibly starts with the retrieval of all strings potentially
containing this affix. We could retrieve all occurrences of -ness, for example,
with a query like ⟨[word=".+ness(es)?"%c]⟩. The recall of this query will be
close to 100 percent, as all words containing the suffix -ness end in the string
ness, optionally followed by the string es in the case of plurals. Depending on
the tokenization of the corpus, this query might miss cases where the word
containing the suffix -ness is the first part of a hyphenated compound, such as
usefulness-rating or consciousness-altering; we could alter the query to something
like ⟨[word=".+ness(es)?(--.+=)?"%c]⟩ if we believe that including these cases
in our sample is crucial. The precision of such a query will not usually be 100
percent, as it will also retrieve words that accidentally happen to end with the
string specified in our query – in the case of -ness, these would be words like
witness, governess or place names like Inverness. The degree of precision will de-
pend on how unique the string in our query is for the affix in question; for -ness
and -ity it is fairly high, as there are only a few words that share the same string
accidentally (examples like those just mentioned for -ness and words like city
and pity for -ity), for a suffix like -ess (‘female animate entity’) it is quite low, as
a query like ⟨[word=".+ess(es)?"%c]⟩ will also retrieve all words with the suf-
fixes -ness and -less, as well as many words whose stem ends in ess, like process,
success, press, access, address, dress, guess and many more.
However, once we have extracted and – if necessary – manually cleaned up
our data set, we are faced with a problem that does not present itself when study-
ing lexis or grammar: the very fact that affixes do not occur independently but
always as parts of words, some of which (like wordform-centeredness in the first
sentence of this chapter) have been created productively on the fly for a specific
purpose, while others (like ingenuity in the same sentence) are conventionalized
lexical items that are listed in dictionaries, even though they are theoretically the
9 Morphology
result of attaching an affix to a known stem (like ingen-, also found in ingenious
and, confusingly, its almost-antonym ingenuous). We have to keep the difference
between these two kinds of words in mind when constructing morphological re-
search designs; since the two kinds are not always clearly distinguishable, this is
more difficult than it sounds. Also, the fact that affixes always occur as parts of
words has consequences for the way we can, and should, count them; in quanti-
tative corpus-linguistics, this is a crucial point, so I will discuss it in quite some
detail before we turn to our case studies.
310
9.1 Quantifying morphological phenomena
query will retrieve 1 651 908 hits, so it seems that there are 1 651 908 instances of
the s-possessive in the BNC.
However, there is a crucial difference between the two situations: in the case
of the word the, every instance is identical to all others (if we ignore upper and
lower case). This is not the case for the s-possessive. Of course, here, too, many
instances are identical to other instances: there are exact repetitions of proper
names, like King’s Cross (322 hits) or People’s revolutionary party (47), of (parts
of) idiomatic expressions, like arm’s length (216) or heaven’s sake (187) or non-
idiomatic but nevertheless fixed phrases like its present form (107) or child’s best
interest (26), and also of many free combinations of words that recur because
they are simply communicatively useful in many situations, like her head (5105),
his younger brother (112), people’s lives (224) and body’s immune system (29).
This means that there are two different ways to count occurrences of the s-
possessive. First, we could simply count all instances without paying any atten-
tion to whether they recur in identical form or not. When looking at occurrences
of a linguistic item or structure in this way, they are referred to as tokens, so
1 651 908 is the token frequency of the possessive. Second, we could exclude repe-
titions and count only the number of instances that are different from each other,
for example, we would count King’s Cross only the first time we encounter it,
disregarding the other 321 occurrences. When looking at occurrences of linguis-
tic items in this way, they are referred to as types; the type frequency of the
s-possessive in the BNC is 268 450 (again, ignoring upper and lower case). The
type frequency of the, of course, is 1.
Let us look at one more example of the type/token distinction before we move
on. Consider the following famous line from the theme song of the classic televi-
sion series “Mister Ed”:
At the word level, it consists of nine tokens (if we ignore punctuation): a, horse,
is, a, horse, of, course, of, and course, but only of five types: a, horse, is, of, and
course. Four of these types occur twice, one (is) occurs only once. At the level of
phrase structure, it consists of seven tokens: the NPs a horse, a horse, course, and
course, the PPs of course and of course, and the VP is a horse, but only of three
types: VP, NP and PP.
In other words, we can count instances at the level of types or at the level of
tokens. Which of the two levels is relevant in the context of a particular research
design depends both on the kind of phenomenon we are counting and on our
research question. When studying words, we will normally be interested in how
311
9 Morphology
often they are used under a particular condition, so it is their token frequency
that is relevant to us; but we could imagine designs where we are mainly inter-
ested in whether a word occurs at all, in which case all that is relevant is whether
its type frequency is one or zero. When studying grammatical structures, we will
also mainly be interested in how frequently a particular grammatical structure
is used under a certain condition, regardless of the words that fill this structure.
Again, it is the token frequency that is relevant to us. However, note that we can
(to some extent) ignore the specific words filling our structure only because we
are assuming that the structure and the words are, in some meaningful sense,
independent of each other; i.e., that the same words could have been used in a
different structure (say, an of -possessive instead of an s-possessive) or that the
same structure could have been used with different words (e.g. John’s spouse in-
stead of his wife). Recall that in our case studies in Chapter 6 we excluded all
instances where this assumption does not hold (such as proper names and fixed
expressions); since there is no (or very little) choice with these cases, including
them, let alone counting repeated occurrences of them, would have added noth-
ing (we did, of course, include repetitions of free combinations, of which there
were four in our sample: his staff, his mouth, his work and his head occurred twice
each).
Obviously, instances of morphemes (whether inflectional or derivational) can
be counted in the same two ways. Take the following passage from William
Shakespeare’s play Julius Cesar:
(2) CINNA: ... Am I a married man, or a bachelor? Then, to answer every man
directly and briefly, wisely and truly: wisely I say, I am a bachelor.
Let us count the occurrences of the adverbial suffix -ly. There are five word
tokens that contain this suffix (directly, briefly, wisely, truly, and wisely), so its
token frequency is five; however, there are only four types, since wisely occurs
twice, so its type frequency in this passage is four.
Again, whether type or token frequency is the more relevant or useful measure
depends on the research design, but the issue is more complicated than in the
case of words and grammatical structures. Let us begin to address this problem
by looking at the diminutive affixes -icle (as in cubicle, icicle) and mini- (as in
minivan, mini-cassette).
312
9.1 Quantifying morphological phenomena
name Pericles is one of the few false hits that the query ⟨[word=".+icles?"%c]⟩
will retrieve). It is more difficult in the case of mini-, since there are words like
minimal, minister, ministry, miniature and others that start with the string mini
but do not contain the prefix mini-. Once we have cleaned up our concordances
(available in the Supplementary Online Material, file LMY7), we will find that -icle
has a token frequency of 20 772 – more than ten times that of mini-, which occurs
only 1702 times. We might thus be tempted to conclude that -icle is much more
important in the English language than mini-, and that, if we are interested in
English diminutives, we should focus on -icle. However, this conclusion would be
misleading, or at least premature, for reasons related to the problems introduced
above.
Recall that affixes do not occur by themselves, but always as parts of words
(this is what makes them affixes in the first place). This means that their token fre-
quency can reflect situations that are both quantitatively and qualitatively very
different. Specifically, a high token frequency of an affix may be due to the fact
that it is used in a small number of very frequent words, or in a large number of
very infrequent words (or something in between). The first case holds for -icle:
the three most frequent words it occurs in (article, vehicle and particle) account
for 19 195 hits (i.e., 92.41 percent of all occurrences). In contrast, the three most
frequent words with mini- (mini-bus, mini-bar and mini-computer) account for
only 557 hits, i.e. 32.73 percent of all occurrences. To get to 92.4 percent, we would
have to include the 253 most frequent words (roughly two thirds of all types).
In other words, the high token frequency of -icle tells us nothing (or at least
very little) about the importance of the affix; if anything, it tells us something
about the importance of some of the words containing it. This is true regardless of
whether we look at its token frequency in the corpus as a whole or under specific
conditions; if its token frequency turned out to be higher under one condition
than under the other, this would point to the association between that condi-
tion and one or more of the words containing the affix, rather than between the
condition and the affix itself.
For example, the token frequency of the suffix -icle is higher in the BROWN
corpus (269 tokens) than in the LOB corpus (225 tokens). However, as Table 9.1
shows, this is simply due to differences in the frequency of individual words – the
words particle and vehicle are substantially more frequent in the BROWN corpus,
and while, conversely, article is more frequent in the LOB corpus, it cannot make
up for the difference. As the 𝜒 2 components show, the difference in frequency of
some of the individual words is even statistically significant, but nothing follows
from this with respect to the suffix -icle.
313
9 Morphology
Corpus
Word lob brown Total
article Obs.: 126 Obs.: 99 225
Exp.: 102.48 Exp.: 122.52
𝜒 2: 5.40 𝜒 2: 4.52
particle Obs.: 38 Obs.: 64 102
Exp.: 46.46 Exp.: 55.54
𝜒 2: 1.54 𝜒 2: 1.29
vehicle Obs.: 39 Obs.: 88 127
Exp.: 57.84 Exp.: 69.16
𝜒 2: 6.14 𝜒 2: 5.13
chronicle Obs.: 7 Obs.: 7 14
Exp.: 6.38 Exp.: 7.62
𝜒 2: 0.06 𝜒 2: 0.05
ventricle Obs.: 8 Obs.: 4 12
Exp.: 5.47 Exp.: 6.53
𝜒 2: 1.18 𝜒 2: 0.98
auricle Obs.: 5 Obs.: 0 5
Exp.: 2.28 Exp.: 2.72
𝜒 2: 3.26 𝜒 2: 2.72
fascicle Obs.: 0 Obs.: 3 3
Exp.: 1.37 Exp.: 1.63
𝜒 2: 1.37 𝜒 2: 1.14
testicle Obs.: 0 Obs.: 2 2
Exp.: 0.91 Exp.: 1.09
𝜒 2: 0.91 𝜒 2: 0.76
conventicle Obs.: 1 Obs.: 0 1
Exp.: 0.46 Exp.: 0.54
𝜒 2: 0.65 𝜒 2: 0.54
cuticle Obs.: 1 Obs.: 0 1
Exp.: 0.46 Exp.: 0.54
𝜒 2: 0.65 𝜒 2: 0.54
canticle Obs.: 0 Obs.: 1 1
Exp.: 0.46 Exp.: 0.54
𝜒 2: 0.46 𝜒 2: 0.38
icicle Obs.: 0 Obs.: 1 1
Exp.: 0.46 Exp.: 0.54
𝜒 2: 0.46 𝜒 2: 0.38
Total 225 269 494
314
9.1 Quantifying morphological phenomena
Even if all words containing a particular affix were more frequent under one
condition (e.g. in one variety) than under another, this would tell us nothing
certain about the affix itself: while such a difference in frequency could be due to
the affix itself (as in the case of the adverbial suffix -ly, which is disappearing from
American English, but not from British English), it could also be due exclusively
to the words containing the affix.
This is not to say that the token frequencies of affixes can never play a useful
role; they may be of interest, for example, in cases of morphological alternation
(i.e. two suffixes competing for the same stems, such as -ic and -ical in words like
electric/al); here, we may be interested in the quantitative association between
particular stems and one or the other of the affix variants, essentially giving us
a collocation-like research design based on token frequencies. But for most re-
search questions, the distribution of token frequencies under different conditions
is meaningless.
315
9 Morphology
every 670 words. For mini-, the type-token ratio is much higher: it occurs in 382
different words, so its TTR is 382/1702 = 0.2244. In other words, almost a quarter
of all occurrences of mini- are different from each other. Put differently, if we go
through the occurrences of mini- in the BNC word by word, the probability that
the next instance is a new type would be 22.4 percent, so we will encounter a
new type about every four to five hits. The differences in their TTRs suggests
that mini-, in its own right, is much more central in the English lexicon than
-icle, even though the latter has a much higher token frequency. Note that this
is a statement only about the affixes; it does not mean that the words containing
mini- are individually or collectively more important than those containing -icle
(on the contrary: words like vehicle, article and particle are arguably much more
important than words like minibus, minicomputer and minibar).
Likewise, observing the type frequency (i.e. the TTR) of an affix under different
conditions provides information about the relationship between these conditions
and the affix itself, albeit one that is mediated by the lexicon: it tells us how
important the suffix in question is for the subparts of the lexicon that are relevant
under those conditions. For example, there are 7 types and 9 tokens for mini- in
the 1991 British FLOB corpus (two tokens each for mini-bus and mini-series and
one each for mini-charter, mini-disc, mini-maestro, mini-roll and mini-submarine),
so the TTR is 7/9 = 0.7779. In contrast, in the 1991 US-American FROWN corpus,
there are 11 types and 12 tokens (two tokens for mini-jack, and one token each
for mini-cavalry, mini-cooper, mini-major, mini-retrospective, mini-version, mini-
boom, mini-camp, mini-grinder, mini-series, and mini-skirt), so the TTR is 11/12 =
0.9167. This suggests that the prefix mini- was more important to the US-English
lexicon than to the British English lexicon in the 1990s, although, of course, the
samples and the difference between them are both rather small, so we would not
want to draw that conclusion without consulting larger corpora and, possibly,
testing for significance first (a point I will return to in the next subsection).
316
9.1 Quantifying morphological phenomena
at some point borrowed a large number of words containing it; this is the case
for a number of Romance affixes in English, occurring in words borrowed from
Norman French but never (or very rarely) used to coin new words. An example
is the suffix -ence/-ance occurring in many Latin and French loanwords (such
as appearance, difference, existence, influence, nuisance, providence, resistance, sig-
nificance, vigilance, etc.), but only in a handful of words formed in English (e.g.
abidance, forbearance, furtherance, hinderance, and riddance).
In order to determine the productivity (and thus the current importance) of
affixes at a particular point in time, Harald Baayen (cf. e.g. Baayen 2009 for an
overview) has suggested that we should focus on types that only occur once in
the corpus, so-called hapax legomena (Greek for ‘said once’). The assumption is
that productive uses of an affix (or other linguistic rule) should result in one-off
coinages (some of which may subsequently spread through the speech commu-
nity while others will not).
Of course, not all hapax legomena are the result of productive rule-application:
the words wordform-centeredness and ingenuity that I used in the first sentence of
this chapter are both hapax legomena in this book (or would be, if I did not keep
mentioning them). However, wordform-centeredness is a word I coined produc-
tively and which is (at the time of writing) not documented anywhere outside of
this book; in fact, the sole reason I coined it was in order to use it as an example
of a hapax legomenon later). In contrast, ingenuity has been part of the English
language for more than four-hundred years (the OED first records it in 1598); it
occurs only once in this book for the simple reason that I only needed it once
(or pretended to need it, to have another example of a hapax legomenon). So a
word may be a hapax legomenon because it is a productive coinage, or because it
is infrequently needed (in larger corpora, the category of hapaxes typically also
contains misspelled or incorrectly tokenized words which will have to be cleaned
up manualy – for example, the token manualy is a hapax legomenon in this book
because I just misspelled it intentionally, but the word manually occurs dozens
of times in this book).
Baayen’s idea is quite straightforwardly to use the phenomenon of hapax le-
gomenon as an operationalization of the construct “productive application of a
rule” in the hope that the correlation between the two notions (in a large enough
corpus) will be substantial enough for this operationalization to make sense.1
1
Note also that the productive application of a suffix does not necessarily result in a hapax
legomenon: two or more speakers may arrive at the same coinage, or a single speaker may like
their own coinage so much that they use it again; some researchers therefore suggest that we
should also pay attention to “dis legomena” (words occurring twice) or even “tris legomena”
(words occurring three times). We will stick with the mainstream here and use only hapax
legomena.
317
9 Morphology
318
9.1 Quantifying morphological phenomena
Affix
-icle mini- Total
Type new 31 382 413
(381.72) (31.28)
seen before 20 741 1320 22 061
(20 390.28) (1670.72)
Total 20 772 1702 22 474
However, while the logic behind this procedure may seem plausible in theory
both for HTRs and for TTRs, in practice, matters are much more complicated. The
reason for this is that, as mentioned above, type-token ratios and hapax-token
ratios are dependent on sample size.
In order to understand why and how this is the case and how to deal with it,
let us leave the domain of morphology for a moment and look at the relationship
between tokens and types or hapax legomena in texts. Consider the opening
sentences of Jane Austen’s novel Pride and Prejudice (the novel is freely available
from Project Gutenberg and in the Supplementary Online Material, file TXQP):
All words without a subscript are new types and hapax legomena at the point
at which they appear in the text; if a word has a subscript, it means that it is a
repetition of a previously mentioned word, the subscript is its token frequency
at this point in the text. The first repetition of a word is additionally marked by a
subscript reading -1, indicating that it ceases to be hapax legomenon at this point,
decreasing the overall count of hapaxes by one.
As we move through the text word by word, initially all words are new types
and hapaxes, so the type- and hapax-counts rise at the same rate as the token
counts. However, it only takes eight token before we reach the first repetition
(the word a), so while the token frequency rises to 8, the type count remains
319
9 Morphology
constant at seven and the hapax count falls to six. Six words later, there is an-
other occurrence of a, so type and hapax counts remain, respectively, at 12 and
11 as the token count rises to 14, and so on. In other words, while the number of
types and the number of hapaxes generally increase as the number of tokens in
a sample increases, they do not increase at a steady rate. The more types have
already occurred, the more types there are to be reused (put simply, speakers will
encounter fewer and fewer communicative situations that require a new type),
which makes it less and less probable that new types (including new hapaxes)
will occur. Figure 9.1 shows how type and hapax counts develop in the first 100
words of Pride and Prejudice (on the left) and in the whole novel (on the right).
7000
Pride and Prejudice (first 100 words, case-insensitive)
Types
60
5000
Hapaxes
4000
40
3000
Hapaxes
2000
20
1000
0
Tokens Tokens
Figure 9.1: TTR and HTR in Jane Austen’s Pride and Prejudice
As we can see by looking at the first 100 words, type and hapax counts fall
below the token counts fairly quickly: after 20 tokens, the TTR is 18/20 = 0.9 and
the HTR is 17/20 = 0.85, after 40 tokens the TTR is 31/40 = 0.775 and the HTR is
26/40 = 0.65, after 60 tokens the HTR is 42/60 = 0.7 and the TTR is 33/60 = 0.55,
and so on (note also how the hapax-token ratio sometimes drops before it rises
again, as words that were hapaxes up to a particular point in the text reoccur
and cease to be counted as hapaxes). If we zoom out and look at the entire novel,
we see that the growth in hapaxes slows considerably, to the extent that it has
almost stopped by the time we reach the end of the novel. The growth in types
also slows, although not as much as in the case of the hapaxes. In both cases this
means that the ratios will continue to fall as the number of tokens increases.
Now imagine we wanted to use the TTR and the HTR as measures of Jane
Austen’s overall lexical productivity (referred to as “lexical richness” in compu-
320
9.1 Quantifying morphological phenomena
Type
new ¬new Total
Text Sample first chapter 321 6829 7150
(47.29) (7102.71)
¬first chapter 528 120 679 121 207
(801.71) (120 405.29)
Total 849 127 508 128 357
The TTR for the first chapter is an impressive 0.3781, that for the rest of the
novel is a measly 0.0566, and the difference is highly significant (𝜒 2 = 1688.7, df =
1, 𝑝 < 0.001, 𝜙 = 0.1147). But this is not because there is anything special about
the first chapter; the TTR for the second chapter is 0.3910, that for the third is
0.3457, that for chapter 4 is 0.3943, and so on. The reason why the first chapter
(or any chapter) looks as though it has a significantly higher TTR than the novel
as a whole is simply because the TTR will drop as the size of the text increases.
Therefore, comparing TTRs derived from samples of different sizes will always
make the smaller sample look more productive. In other words, we cannot com-
pare such TTRs, let alone evaluate the differences statistically – the result will
simply be meaningless. The same is true for HTRs, with the added problem that,
under certain circumstances, it will decrease at some point as we keep increasing
the sample size: at some point, all possible words will have been used, so unless
new words are added to the language, the number of hapaxes will shrink again
and finally drop to zero when all existing types have been used at least twice.
We will encounter the same problem when we compare the TTR or HTR of
particular affixes or other linguistic phenomena, rather than that of a text. Con-
sider Figures 9.2a and 9.2b, which show the TTR and the HTR of the verb suffixes
-ise/-ize (occurring in words like realize, maximize or liquidize) and -ify (occur-
ring in words like identify, intensify or liquify).
As we can see, the TTR and HTR of both affixes behave roughly like that of
Jane Austen’s vocabulary as a whole as we increase sample size: both of them
321
9 Morphology
60
150
-ise/-ize
50
Hapax Legomena (LOB)
-ise/-ize
40
100
Types (LOB)
30
20
50
-ify
-ify
10
0
0
0 200 400 600 800 1000 1200 0 200 400 600 800 1000
Tokens Tokens
Figure 9.2: (a) TTRs and (b) HTRs for -ise/-ize and -ify in the LOB cor-
pus
grow fairly quickly at first before their growth slows down; the latter happens
more quickly in the case of the HTR than in the case of the TTR, and, again, we
observe that the HTR sometimes decreases as types that were hapaxes up to a
particular point in the sample reoccur and cease to be hapaxes.
Taking into account the entire sample, the TTR for -ise/-ize is 105/834 = 0.1259
and that for -ify is 49/356 = 0.1376; it seems that -ise/-ize is slightly more impor-
tant to the lexicon of English than -ify. A 𝜒 2 test suggests that the difference is
not significant (cf. Table 9.4; 𝜒 2 = 0.3053, df = 1, 𝑝 > 0.05).
Table 9.4: Type/token ratios of -ise/-ize and -ify (LOB)
Type
new seen before Total
Affix -ise/-ize 105 729 834
(107.93) (726.07)
-ify 49 307 356
(46.07) (309.93)
Total 154 1036 1190
Likewise, taking into account the entire sample, the HTR for -ise/-ize is 47/834 =
0.0563 and that for -ify is 17/365 = 0.0477; it seems that -ise/-ize is slightly more
productive than -ify. However, again, the difference is not significant (cf. Ta-
ble 9.5; 𝜒 2 = 0.3628, df = 1, 𝑝 > 0.05).
322
9.1 Quantifying morphological phenomena
Type
hapax ¬hapax Total
Affix -ise/-ize 47 787 834
(44.85) (789.15)
-ify 17 339 356
(19.15) (336.85)
Total 64 1126 1190
However, note that -ify has a token frequency that is less than half of that of
-ise/-ize, so the sample is much smaller: as in the example of lexical richness in
Pride and Prejudice, this means that the TTR and the HTR of this smaller sample
are exaggerated and our comparisons in Tables 9.4 and 9.5 as well as the accom-
panying statistics are, in fact, completely meaningless.
The simplest way of solving the problem of different sample sizes is to create
samples of equal size for the purposes of comparison. We simply take the size of
the smaller of our two samples and draw a random sample of the same size from
the larger of the two samples (if our data sets are large enough, it would be even
better to draw random samples for both affixes). This means that we lose some
data, but there is nothing we can do about this (note that we can still include the
discarded data in a qualitative description of the affix in question).2
Figures 9.3a and 9.3b show the growth rates of the TTR and the HTR of a sub-
sample of 356 tokens of -ise/-ize in comparison with the total sample of the same
size for -ify (the sample was derived by first deleting every second hit, then every
seventh hit and finally every ninetieth hit, making sure that the remaining hits
are spread throughout the corpus).
The TTR of -ise/-ize based on the random sub-sample is 78/356 = 0.2191, that
of -ify is still 49/356 = 0.1376; the difference between the two suffixes is much
clearer now, and a 𝜒 2 test shows that it is very significant, although the effect
size is weak (cf. Table 9.6; 𝜒 2 = 8.06, df = 1, 𝑝 < 0.01, 𝜙 = 0.1064).
2
In studies of lexical richness, a measure called Mean Segmental Type-Token Ratio (MSTTR) is
sometimes used (cf. Johnson 1944). This measure is derived by dividing the texts under inves-
tigation into segments of equal size (often segments of 100 words), determining the TTR for
each segment, and then calculating an average TTR. This allows us to compare the TTR of
texts of different sizes without discarding any data. However, this method is not applicable to
the investigation of morphological productivity, as most samples of 100 words (or even 1000
or 10 000 words) will typically not contain enough cases of a given morpheme to determine a
meaningful TTR.
323
9 Morphology
80
50
-ise/-ize
-ise/-ize
40
Hapax Legomena (LOB, Sample)
60
Types (LOB, Sample)
30
-ify
40
20
-ify
20
10
0
0
0 100 200 300 400 0 100 200 300 400
Tokens Tokens
Figure 9.3: (a) TTRs and (b) HTRs for -ise/-ize and -ify in the LOB cor-
pus
Type
new seen before Total
Affix -ise/-ize 78 278 356
(63.50) (292.50)
-ify 49 307 356
(63.50) (292.50)
Total 127 585 712
Type
hapax ¬hapax Total
Affix -ise/-ize 41 315 356
(29.00) (327.00)
-ify 17 339 356
(29.00) (327.00)
Total 58 654 712
324
9.2 Case studies
Likewise, the HTR of -ise/-ize based on our sub-sample is 41/356 = 0.1152, the
HTR of -ify remains 17/365 = 0.0477. Again, the difference is much clearer, and
it, too, is now very significant, again with a weak effect size (cf. Table 9.7; 𝜒 2 =
10.81, df = 1, 𝑝 < 0.01, 𝜙 = 0.1232).
In the case of the HTR, decreasing the sample size is slightly more problematic
than in the case of the TTR. The proportion of hapax legomena actually resulting
from productive rule application becomes smaller as sample size decreases. Take
example (2) from Shakespeare’s Julius Caesar above: the words directly, briefly
and truly are all hapaxes in the passage cited, but they are clearly not the result
of a productively applied rule-application (all of them have their own entries
in the OALD, for example). As we increase the sample, they cease to be hapaxes
(directly occurs 9 times in the entire play, briefly occurs 4 times and truly 8 times).
This means that while we must draw random samples of equal size in order to
compare HTRs, we should make sure that these samples are as large as possible.
325
9 Morphology
which form process verbs from nominal and adjectival bases, or -ic and -ical,
which form adjectives from typically nominal bases). Here, too, corpus linguis-
tics provides useful tools, for example to determine whether the choice between
affixes is influenced by syntactic, semantic or phonological properties of stems.
3
This is a simplification: stress-shift only occurs with unstressed closed syllables or sequence
of two unstressed syllables (SYLlable – sylLAbify). Occasionally, stem-final consonants are
deleted (as in liquid – liquify); cf. Plag (1999) for a more detailed discussion.
326
9.2 Case studies
dependent variables are Syllabicity with the values monosyllabic and poly-
syllabic, and Stress Shift with the values required vs. not required (both
of which should be self-explanatory).
Our design compares two predefined groups of types with respect to the dis-
tribution that particular properties have in these groups; this means that we do
not need to calculate TTRs or HTRs, but that we need operational definitions
of the values established word and neologism. Following Plag, let us define
neologism as “coined in the 20th century”, but let us use a large historical dictio-
nary (the Oxford English Dictionary, 3rd edition) and a large corpus (the BNC)
in order to identify words matching this definition; this will give us the opportu-
nity to evaluate the idea that hapax legomena are a good way of operationalizing
productivity.
Excluding cases with prefixed stems, the OED contains 456 entries or sub-en-
tries for verbs with -ify, 31 of which are first documented in the 20th century. Of
the latter, 21 do not occur in the BNC at all, and 10 do occur in the BNC, but are
not hapaxes (see Table 9.8 below). The BNC contains 30 hapaxes, of which 13 are
spelling errors and 7 are first documented in the OED before the 20th century
(carbonify, churchify, hornify, preachify, saponify, solemnify, townify). This leaves
10 hapaxes that are plausibly regarded as neologisms, none of which are listed
in the OED (again, see Table 9.8). In addition, there are four types in the BNC
that are not hapax legomena, but that are not listed in the OED; careful cross-
checks show that these are also neologisms. Combining all sources, this gives us
45 neologisms.
Before we turn to the definition and sampling of established types, let us deter-
mine the precision and recall of the operational definition of neologism as “hapax
legomenon in the BNC”, using the formulas introduced in Chapter 4. Precision is
defined as the number of true positives (items that were found and that actually
are what they are supposed to be) divided by the number of all positives (all items
found); 10 of the 30 hapaxes in the BNC are actually neologisms, so the precision
is 10/30 = 0.3333. Recall is defined as the number of true positives divided by the
number of true positives and false negatives (i.e. all items that should have been
found); 10 of the 45 neologisms were actually found by using the hapax defini-
tion, so the recall is 10/45 = 0.2222. In other words, neither precision nor recall of
the method are very good, at least for moderately productive affixes like -ify (the
method will presumably give better results with highly productive affixes). Let
us also determine the recall of neologisms from the OED (using the definition
“first documented in the 20th century according to the OED”): the OED lists 31
of the 45 neologisms, so the recall is 31/45 = 0.6889; this is much better than the
recall of the corpus-based hapax definition, but it also shows that if we combine
327
9 Morphology
corpus data and dictionary data, we can increase coverage substantially even for
moderately productive affixes.
Let us now turn to the definition of established types. Given our definition of
neologisms, established types would first have to be documented before the 20th
century, so we could use the 420 types in the OED that meet this criterion (again,
excluding prefixed forms). However, these 420 types contain many very rare or
even obsolete forms, like duplify ‘to make double’, eaglify ‘to make into an eagle’
or naucify ‘to hold in low esteem’. Clearly, these are not “established” in any
meaningful sense, so let us add the requirement that a type must occur in the BNC
at least twice to count as established. Let us further limit the category to verbs
first documented before the 19th century, in order to leave a clear diachronic gap
between the established types and the productive types. This leaves the words
in Table 9.9.4
Let us now evaluate the hypotheses. Table 9.10 shows the type frequencies for
monosyllabic and polysyllabic stems in the two samples. In both cases, there is
4
Interestingly, leaving out words coined in the 19th century does not make much of a difference:
although the 19th century saw a large number of coinages (with 138 new types it was the most
productive century in the history of the suffix), few of these are frequent enough today to occur
in the BNC; if anything, we should actually extend our definition of neologisms to include the
19th century.
328
9.2 Case studies
a preference for monosyllabic stems (as expected), but interestingly, this prefer-
ence is less strong among the neologisms than among the established types and
this difference is very significant (𝜒 2 = 7.37, df = 1, 𝑝 < 0.01, 𝜙 = 0.2577).
Table 9.9: Control sample of established types with the suffix -ify.
Number of Syllables
monosyllabic polysyllabic Total
Status established 57 9 66
(51.14) (14.86)
neologism 29 16 45
(34.86) (10.14)
Total 86 25 111
Given the fact that there is a significantly higher number of neologisms with
polysyllabic stems than expected on the basis of established types, the second
hypothesis becomes more interesting: does this higher number of polysyllabic
stems correspond with a greater willingness to apply it to stems that then have to
undergo stress shift (which would be contrary to our hypothesis, which assumes
that there will be no difference between established types and neologisms)?
Table 9.11 shows the relevant data: it seems that there might indeed be such
a greater willingness, as the number of neologisms with polysyllabic stems re-
329
9 Morphology
quiring stress shift is higher than expected; however, the difference is not statis-
tically significant (𝜒 2 = 1.96, df = 1, 𝑝 > 0.05, 𝜙 = 0.28) (strictly speaking, we
cannot use the 𝜒 2 test here, since half of the expected frequencies are below 5,
but Fisher’s exact test confirms that the difference is not significant).
Table 9.11: Stress shift with polysyllabic stems with -ify
Shift
not required required Total
Status established 3 6 9
(4.68) (4.32)
neologism 10 6 16
(8.32) (7.68)
Total 13 12 25
This case study demonstrates some of the problems and advantages of using
corpora to identify neologisms in addition to existing dictionaries. It also consti-
tutes an example of a purely type-based research design; note, again, that such a
design is possible here because we are not interested in the type frequency of a
particular affix under different conditions (in which case we would have to cal-
culate a TTR to adjust for different sample sizes), but in the distribution of the
variables Syllable Length and Stress Shift in two qualitatively different cat-
egories of types. Finally, note that the study comes to different conclusions than
the impressionistic analysis in Plag (1999), so it demonstrates the advantages of
strictly quantified designs.
330
9.2 Case studies
(6) electric
a. connected with electricity; using, produced by or producing electricity
(OALD)
b. of or relating to electricity; operated by electricity (MW)
c. working by electricity; used for carrying electricity; relating to elec-
tricity (MD)
d. of, produced by, or worked by electricity (CALD)
e. needing electricity to work, produced by electricity, or used for carry-
ing electricity (LDCE)
f. work[ing] by means of electricity; produced by electricity; designed to
carry electricity; refer[ring] to the supply of electricity (Cobuild)
(7) electrical
a. connected with electricity; using or producing electricity (OALD)
b. of or relating to electricity; operated by electricity (MW) [mentioned
as a synonym under corresp. sense of electric]
c. working by electricity; relating to electricity (MD)
d. related to electricity (CALD, LDCE)
e. work[ing] by means of electricity; supply[ing] or us[ing] electricity;
energy ... in the form of electricity; involved in the production and
supply of electricity or electrical goods (Cobuild)
331
9 Morphology
332
9.2 Case studies
Adjective
Noun electric electrical Total
device Obs.: 17 Obs.: 13 30
Exp.: 12.08 Exp.: 17.92
𝜒 2: 2.01 𝜒 2: 1.35
bulb, calculating machine, amplifier, apparatus, fire,
chair, cooker, dog, drill, fence, goods, machine, machinery,
fire, heating element, light power unit, sign, supply,
switch, motor, mowing, stove, system, system, transmission
torch, tricycle
energy Obs.: 11 Obs.: 13 24
Exp.: 9.66 Exp.: 14.34
𝜒 2: 0.19 𝜒 2: 0.12
attraction, bill, blue, current, accident, activity, condition,
effect, field, force, light, space load, output, phenomenon,
constant property, resistance
Total 31 46 77
333
9 Morphology
respect to their preferences for these categories. Since we are interested in the
nature of this difference, it is much more insightful to look at the 𝜒 2 components
individually. This gives us a better idea where the overall significant difference
comes from. In this case, it comes almost exclusively from the fact that electri-
cal is indeed associated with the research and supply of electricity (industry),
although there is a slight preference for electric with nouns referring to devices.
Generally, the two words seem to be relatively synonymous, at least in 1960s
British English.
Let us repeat the study with the BROWN corpus. Table 9.13 lists the token
frequencies for the individual categories and, again, all types found for each cat-
egory.
Again, the overall difference between the two words is significant and the ef-
fect is slightly stronger than in the LOB corpus (𝜒 2 = 22.83, df = 3, 𝑝 < 0.001, 𝜙 =
0.3413), suggesting a stronger differentiation between them. Again, the most in-
teresting question is where the effect comes from. In this case, devices are much
more frequently referred to as electric and less frequently as electrical than ex-
pected, and, as in the LOB corpus, the nouns in the category industry are more
frequently referred to as electrical and less frequently as electric than expected
(although not significantly so). Again, there is no clear difference with respect to
the remaining two categories.
Broadly speaking, then, one of our expectations is borne out by the British
English data and one by the American English data. We would now have to look
at larger corpora to see whether this is an actual difference between the two
varieties or whether it is an accidental feature of the corpora used here. We might
also want to look at more modern corpora – the importance of electricity in our
daily lives has changed quite drastically even since the 1960s, so the words may
have specialized semantically more clearly in the meantime. Finally, we would
look more closely at the categories we have used, to see whether a different or
a more fine-grained categorization might reveal additional insights (Kaunisto
(1999) goes on to look at his categories in more detail, revealing more fine-grained
differences between the words).
Of course, this kind of investigation can also be designed as an inductive study
of differential collocates (again, like the study of synonyms such as high and
tall). Let us look at the nominal collocates of electric and electrical in the BNC.
Table 9.14 shows the results of a differential-collocate analysis, calculated on the
basis of all occurrences of electric/al in the BNC that are directly followed by a
noun.
334
9.2 Case studies
Adjective
Noun electric electrical Total
device Obs.: 29 Obs.: 3 32
Exp.: 18.29 Exp.: 13.71
𝜒 2: 6.28 𝜒 2: 8.37
amplifier, blanket, bug, chair control, display, torquers
computer, drive, gadget, hand
tool, hand-blower, heater,
heater, horn, icebox, lantern,
model, range, razor,
refrigerator, signs, spit,
toothbrush
energy Obs.: 15 Obs.: 17 32
Exp.: 18.29 Exp.: 13.71
𝜒 2: 0.59 𝜒 2: 0.79
arc, current, discharge, power, body, characteristic, charges,
rate, shock, universe, utility distribution, conditional,
rate energy, force, form, power,
shock, signal, stimulation
circuit Obs.: 4 Obs.: 9 13
Exp.: 7.43 Exp.: 5.57
𝜒 2: 1.58 𝜒 2: 2.11
circuit, light plant, power contact line, outlet, pickoff,
plant wire, wiring
Total 56 42 98
335
9 Morphology
The results largely agree with the preferences also uncovered by the more
careful (and more time-consuming) categorization of a complete data set, with
one crucial difference: there are members of the category device among the sig-
nificant differential collocates of both variants. A closer look reveals a systematic
difference within this category: the device collocates of electric refer to specific
devices (such as light, guitar, light, kettle, etc.); in contrast, the device collocates
of electrical refer to general classes of devices (equipment, appliance, system). This
difference was not discernible in the LOB and BROWN datasets (presumably be-
cause they were too small), but it is discernible in the data set used by Kaunisto
(1999), who posits corresponding subcategories. Of course, the BNC is a much
more recent corpus than LOB and BROWN, so, again, a diachronic comparison
would be interesting.
There is an additional pattern that would warrant further investigation: there
are collocates for both variants that correspond to what some of the dictionaries
we consulted refer to as ‘produced by energy’: shock, field and fire for electric and
signal, energy, impulse for electrical. It is possible that electric more specifically
characterizes phenomena that are caused by electricity, while electrical charac-
terizes phenomena that manifest electricity.
The case study demonstrates, then, that a differential-collocate analysis is a
good alternative to the manual categorization and category-wise comparison of
all collocates: it allows us to process very large data sets very quickly and then
focus on the semantic properties of those collocates that are shown by the statis-
tical analysis to differentiate between the variants.
We must keep in mind, however, that this kind of study does not primarily
uncover differences between affixes, but differences between specific word pairs
containing these affixes. They are, as pointed out above, essentially lexical stud-
ies of near-synonymy. Of course, it is possible that by performing such analyses
for a large number of word pairs containing a particular affix pair, general se-
mantic differences may emerge, but since we are frequently dealing with highly
lexicalized forms, there is no guarantee for this. Gries (2001; 2003b) has shown
that -ic/-ical pairs differ substantially in the extent to which they are synony-
mous; for example, he finds substantial difference in meaning for politic/political
or poetic/poetical, but much smaller differences, for example, for bibliographic/
bibliographical, with electric/electrical somewhere in the middle. Obviously, the
two variants have lexicalized independently in many cases, and the specific dif-
ferences in meaning resulting from this lexicalization process are unlikely to fall
into clear general categories.
336
9.2 Case studies
337
9 Morphology
338
9.2 Case studies
339
9 Morphology
340
9.2 Case studies
But let us start more humbly by laying the empirical foundations for such
discussions and test the observation by Lindsay (2011) and Lindsay & Aronoff
(2013). The authors themselves do so by comparing the ratio of the two suffixes
for stems that occur with both suffixes. Let us return to their approach later, and
start by looking at the overall distribution of stems with -ic and -ical.
First, though, let us see what we can find out by looking at the overall dis-
tribution of types, using the four-million-word BNC Baby. Once we remove all
prefixes and standardize the spelling, there are 846 types for the two suffixes.
There is a clear overall preference for -ic (659 types) over -ical (187 types) (inci-
dentally, there are only 54 stems that occur with both suffixes). For stems with
-(o)log-, the picture is drastically different: there is an overwhelming preference
for -ical (55 types) over -ic (3 types). We can evaluate this difference statistically,
as shown in Table 9.17.
Table 9.17: Preference of stems with -olog- for -ic and -ical
Suffix Variant
-ic -ical Total
Stem Type with -olog- 3 55 58
(45.18) (12.82)
without -olog- 656 132 788
(613.82) (174.18)
Total 659 187 846
Unsurprisingly, the difference between stems with and without the affix -olog-
is highly significant (𝜒 2 = 191.27, df = 1, 𝑝 < 0.001) – stems with -olog- clearly
favor the variant -ical against the general trend.
As mentioned above, this could be due specifically to the affix -olog-, but it
could also be a general preference of derived stems for -ical. In order to determine
this, we have to look at derived stems with other affixes. There are a number of
other affixes that occur frequently enough to make them potentially interesting,
such as -ist-, as in statistic(al) (-ic vs. -ical: 74/2), -graph-, as in geographic(al) (19/8),
or -et-, as in arithmetic(al) (32/9). Note that all of them have more types with -ic,
which suggests that derived stems in general, possibly due to their length, prefer
-ic and that -olog- really is an exception.
But there is a methodological issue that we have to address before we can
really conclude this. Note that we have been talking of a “preference” of particular
341
9 Morphology
stems for one or the other suffix, but this is somewhat imprecise: we looked at the
total number of stem types with -ic and -ical with and without additional suffixes.
While differences in number are plausibly attributed to preferences, they may
also be purely historical leftovers due to the specific history of the two suffixes
(which is rather complex, involving borrowing from Latin, Greek and, in the
case of -ical, French). More convincing evidence for a productive difference in
preferences would come from stems that take both -ic and -ical (such as electric/al,
symmetric/al or numeric/al, to take three examples that display a relatively even
distribution between the two): for these stems, there is obviously a choice, and
we can investigate the influence of additional affixes on that choice.
Lindsay (2011) and Lindsay & Aronoff (2013) focus on precisely these stems,
checking for each one whether it occurs more frequently with -ic or with -ical
and calculating the preference ratio mentioned above. They then compare the
ratio of all stems to that of stems with -olog- (see Table 9.18).
Table 9.18: Stems favoring -ic or -ical in the COCA (Lindsay 2011: 194)
Total 1465 86
The ratios themselves are difficult to compare statistically, but they are clearly
the right way of measuring the preference of stems for a particular suffix. So let
us take Lindsay and Aronoff’s approach one step further: Instead of calculating
the overall preference of a particular type of stem and comparing it to the overall
preference of all stems, let us calculate the preference for each stem individually.
This will give us a preference measure for each stem that is, at the very least,
ordinal.5 We can then rank stems containing a particular affix and stems not
containing that affix (or containing a specific different affix) by their preference
for one or the other of the suffix variants and use the Mann-Whitney U -test to
determine whether the stems with -olog- tend to occur towards the -ical end of
5
In fact, measures derived in this way are cardinal data, as the value can range from 0 to 1 with
every possible value in between; it is safer to treat them as ordinal data, however, because we
don’t know whether such preference values are normally distributed. In fact, since they are
based on word frequency data, which we know not to be normally distributed, it is a fair guess
that the preference data are not normally distributed.
342
9.2 Case studies
the ranking. That way we can treat preference as the matter of degree that it
actually is, rather than as an absolute property of stems.
The BNC Baby does not contain enough derived stems that occur with both
suffix variants, so let us focus on two specific suffixes and extract the relevant
data from the full BNC. Since -ist is roughly equal to -olog- in terms of type fre-
quency, let us choose this suffix for comparison. Table 9.19 shows the 34 stems
containing either of these suffixes, their frequency of occurrence with the vari-
ants -ic and -ical, the preference ratio for -ic, and the rank.
The different preferences of stems with -ist and -olog- are very obvious even
from a purely visual inspection of the table: stems with the former occur at the
top of the ranking, stems with the latter occur at the bottom and there is almost
no overlap. This is reflected clearly in the median ranks of the two stem types:
the median for -ist is 4.5 (𝑁 = 8, rank sum = 38), the median for -olog- is 21.5
(𝑁 = 26, rank sum = 557). A Mann-Whitney U test shows that this difference is
highly significant (𝑈 = 2, 𝑁1 = 8, 𝑁2 = 26, 𝑝 < 0.001).
Now that we have established that different suffixes may, indeed, display dif-
ferent preferences for other suffixes (or suffix variants), we could begin to answer
the question why this might be the case. In this instance, the explanation is likely
found in the complicated history of borrowings containing the suffixes in ques-
tion. The point of this case study was not to provide such an explanation but to
show how an empirical basis can be provided using token frequencies derived
from linguistic corpora.
343
9 Morphology
Table 9.19: Preferences of stems containing -ist and -olog- for the suffix
variants -ic and -ical (BNC)
344
9.2 Case studies
detailed analysis than previous studies in that he looks at the prevalence of dif-
ferent stem types for each affix, so that qualitative as well as quantitative differ-
ences in productivity could, in theory, be studied. In practice, unfortunately, the
study offers preliminary insights at best, as it is based entirely on token frequen-
cies, which, as discussed in Section 9.1 above, do not tell us anything at all about
productivity.
We will therefore look at a question inspired by Guz’s study, and use the TTR
and the HTR to study the relative importance and productivity of the nominal-
izing suffix -ship (as in friendship, lordship, etc.) in newspaper language and in
prose fiction. The suffix -ship is known to have a very limited productivity, and
our hypothesis (for the sake of the argument) will be that it is more productive
in prose fiction, since authors of fiction are under pressure to use language cre-
atively (this is not Guz’s hypothesis; his study is entirely explorative).
The suffix has a relatively high token frequency: there are 2862 tokens in the
fiction section of the BNC, and 7189 tokens in the newspaper section (including
all sub-genres of newspaper language, such as reportage, editorial, etc.) (the data
are provided in the Supplementary Online Material, file LAF3). This difference
is not due to the respective sample sizes: the fiction section in the BNC is much
larger than the newspaper section; thus, the difference token frequency would
suggest that the suffix is more important in newspaper language than in fiction.
However, as extensively discussed in Section 9.1.1, token frequency cannot be
used to base such statements on. Instead, we need to look at the type-token ratio
and the hapax-token ratio.
To get a first impression, consider Figure 9.4, which shows the growth of the
TTR (left) and HTR (right) in the full fiction and newspaper sections of the
BNC.
Both the TTR and the HTR suggest that the suffix is more productive in fic-
tion: the ratios rise faster in fiction than in newspapers and remain consistently
higher as we go through the two sub-corpora. It is only when the tokens have
been exhausted in the fiction subcorpus but not in the newspaper subcorpus,
that the ratios in the latter slowly catch up. This broadly supports our hypothe-
sis, but let us look at the genre differences more closely both qualitatively and
quantitatively.
In order to compare the two genres in terms of the type-token and hapax-
token ratios, they need to have the same size. The following discussion is based
on the full data from the fiction subcorpus and a subsample of the newspaper
corpus that was arrived at by deleting every second, then every third and finally
every 192nd example, ensuring that the hits in the sample are spread through the
entire newspaper subcorpus.
345
9 Morphology
100
40
Newspapers
Fiction
80
30
[N-ship]: Types (BNC)
Newspapers
60
20
40
10
20
0
0
0 2000 4000 6000 8000 0 2000 4000 6000 8000
Tokens Tokens
Figure 9.4: Nouns with the suffix -ship in the Genre categories fiction
and newspapers (BNC)
Let us begin by looking at the types. Overall, there are 96 different types, 48
of which occur in both samples (some examples of types that frequent in both
samples are relationship (the most frequent word in the fiction sample), cham-
pionship (the most frequent word in the news sample), friendship, partnership,
lordship, ownership and membership. In addition, there are 36 types that occur
only in the prose sample (for example, churchmanship, dreamership, librarianship
and swordsmanship) and 12 that occur only in the newspaper sample (for exam-
ple, associateship, draughtsmanship, trusteeship and sportsmanship). The number
of types exclusive to each genre suggests that the suffix is more important in
fiction than in newspapers.
The TTR of the suffix in newspaper language is 60/2862 = 0.021, and the HTR
is 20/2862 = 0.007. In contrast, the TTR in fiction is 84/2862 = 0.0294, and the HTR
is 29/2862 = 0.0101. Although the suffix, as expected, is generally not very produc-
tive, it is more productive in fiction than in newspapers. As Table 9.20 shows, this
difference is statistically significant in the sample (𝜒 2 = 4.1, df = 1, 𝑝 < 0.005).
This corroborates our hypothesis, but note that it does not tell us whether the
higher productivity of -ship is something unique about this particular morpheme,
or whether fiction generally has more derived words due to a higher overall lex-
ical richness. To determine this, we would have to look at more than one affix.
Let us now turn to the hapax legomena. These are so rare in both genres that
the difference in TTR is not statistically significant, as Table 9.21 shows (𝜒 2 =
1.67, df = 1, 𝑝 = 0.1966). We would need a larger corpus to see whether the
difference would at some point become significant.
346
9.2 Case studies
Type
new ¬new Total
Genre fiction 84 2778 2862
(72.00) (2790.00)
newspaper 60 2802 2862
(72.00) (2790.00)
Total 144 5580 5724
Type
hapax ¬hapax Total
Genre fiction 29 2833 2862
(24.50) (2837.50)
newspaper 20 2842 2862
(24.50) (2837.50)
Total 49 5675 5724
To conclude this case study, let us look at a particular problem posed by the
comparison of the same suffix in two genres with respect to the HTR. At first
glance – and this is what is shown in Table 9.21 – there seem to be 29 hapaxes
in fiction and 20 in prose. However, there is some overlap: the words generalship,
headship, managership, ministership and professorship occur as hapax legomena
in both samples; other words that are hapaxes in one subsample occur several
times in the other, such as brinkmanship, which is a hapax in fiction but occurs
twice in the newspaper sample, or acquaintanceship, which is a hapax in the
newspaper sample but occurs 15 times in fiction.
It is not straightforwardly clear whether such cases should be treated as ha-
paxes. If we think of the two samples as subsamples of the same corpus, it is very
counterintuitive to do so. It might be more reasonable to count only those words
as hapaxes whose frequency in the combined subsamples is still one. However,
the notion “hapax” is only an operational definition for neologisms, based on the
347
9 Morphology
hope that the number of hapaxes in a corpus (or sub-corpus) is somehow indica-
tive of the number of productive coinages. We saw in Case Study 9.2.1.1 that this
is a somewhat vain hope, as the correlation between neologisms and hapaxes is
not very impressive.
Still, if we want to use this operational definition, we have to stick with it and
define hapaxes strictly relative to whatever (sub-)corpus we are dealing with. If
we extend the criterion for hapax-ship beyond one subsample to the other, why
stop there? We might be even stricter and count only those words as hapaxes
that are still hapaxes when we take the entire BNC into account. And if we take
the entire BNC into account, we might as well count as hapaxes only those words
that occur only once in all accessible archives of the language under investigation.
This would mean that the hapaxes in any sample would overwhelmingly cease to
be hapaxes – the larger our corpus, the fewer hapaxes there will be. To illustrate
this: just two words from the fiction sample retain their status as hapax legomena
if we search the Google Books collection: impress-ship, which does not occur at
all (if we discount linguistic accounts which mention it, such as Trips (2009),
or this book, once it becomes part of the Google Books archive), and cloudship,
which does occur, but only referring to water- or airborne vehicles. At the same
time, the Google Books archive contains hundreds (if not thousands) of hapax
legomena that we never even notice (such as Johnship ‘the state of being the
individual referred to as John’). The idea of using hapax legomena is, essentially,
that a word like mageship, which is a hapax in the fiction sample, but not in the
Google Books archive, somehow stands for a word like Johnship, which is a true
hapax in the English language.
This case study has demonstrated the potential of using the TTR and the HTR
not as a means of assessing morphological richness and productivity as such, but
as a means of assessing genres with respect to their richness and productivity. It
has also demonstrated some of the problems of identifying hapax legomena in
the context of such cross-variety comparisons. As mentioned initially, there are
not too many studies of this kind, but Plag (1999) presents a study of productiv-
ity across written and spoken language that is a good starting point for anyone
wanting to fill this gap.
348
9.2 Case studies
BNC. She finds no difference in productivity for -ness, but a higher productiv-
ity of -ity in the language produced by men (cf. also Säily & Suomela 2009 for
a diachronic study with very similar results). She uses a sophisticated method
involving the comparison of the suffixes’ type and hapax growth rates, but let
us replicate her study using the simple method used in the preceding case study,
beginning with a comparison of type-token ratios.
The BNC contains substantially more speech and writing by male speakers
than by female speakers, which is reflected in differences in the number of affix
tokens produced by men and women: for -ity, there are 2562 tokens produced
by women and 8916 tokens produced by men; for -ness, there are 616 tokens
produced by women and 1154 tokens produced by men (note that unlike Säily,
I excluded the words business and witness, since they did not seem to me to be
synchronically transparent instances of the affix). To get samples of equal size for
each affix, random subsamples were drawn from the tokens produced by men.
Based on these subsamples, the type-token ratios for -ity are 0.0652 for men
and 0.0777 for women; as Table 9.22 shows, this difference is not statistically
significant (𝜒 2 = 3.01, df = 1, 𝑝 < 0.05, 𝜙 = 0.0242).
Table 9.22: Types with -ity in male and female speech (BNC)
Type
new seen before Total
Speaker Sex female 167 2395 2562
(183.00) (2379.00)
male 199 2363 2562
(183.00) (2379.00)
Total 366 4758 5124
The type-token ratios for -ness are much higher, namely 0.1981 for women
and 0.2597 for men. As Table 9.23 shows, the difference is statistically significant,
although the effect size is weak (𝜒 2 = 5.37, df = 1, 𝑝 < 0.05, 𝜙 = 0.066).
Note that Säily investigates spoken and written language separately and she
also includes social class in her analysis, so her results differ from the ones pre-
sented here; she finds a significantly lower HTR for -ness in lower-class women’s
speech in the spoken subcorpus, but not in the written one, and a significantly
lower HTR for -ity in both subcorpora. This might be due to the different meth-
ods used, or to the fact that I excluded business, which is disproportionally fre-
349
9 Morphology
Table 9.23: Types with -ness in male and female speech (BNC)
Type
new seen before Total
Speaker Sex female 122 494 616
(139.00) (477.00)
male 156 460 616
(139.00) (477.00)
Total 278 954 1232
quent in male speech and writing in the BNC and would thus reduce the diver-
sity in the male sample substantially. However, the type-based differences do
not have a very impressive effect size in our design and they are unstable across
conditions in Säily’s, so perhaps they are simply not very substantial.
Let us turn to the HTR next. As before, we are defining what counts as a hapax
legomenon not with reference to the individual subsamples of male and female
speech, but with respect to the combined sample. Table 9.24 shows the hapaxes
for -ity in the male and female samples. The HTRs are very low, suggesting that
-ity is not a very productive suffix: 0.0099 in female speech and 0.016 in male
speech.
Table 9.24: Hapaxes with -ity in sampes of male and female speech
(BNC)
male speech
abnormality, antiquity, applicability, brutality, civility, criminality, deliverability,
divinity, duplicity, eccentricity, eventuality, falsity, femininity, fixity, frivolity,
illegality, impurity, inexorability, infallibility, infirmity, levity, longevity,
mediocrity, obesity, perversity, predictability, rationality, regularity, reliability,
scarcity, seniority, serendipity, solidity, subsidiarity, susceptibility, tangibility,
verity, versatility, virtuality, vitality, voracity
female speech
absurdity, adjustability, admissibility, centrality, complicity, effemininity,
enormity, exclusivity, gratuity, hilarity, humility, impunity, inquisity, morbidity,
municipality, originality, progility, respectability, sanity, scaleability, sincerity,
spontaneity, sterility, totality, virginity
350
9.2 Case studies
Although the difference in HTR is relatively small, Table 9.25 shows that it is
statistically significant, albeit again with a very weak effect size (𝜒 2 = 3.93, df =
1, 𝑝 < 0.05, 𝜙 = 0.0277).
Table 9.25: Hapax legomena with -ity in male and female speech (BNC)
Type
hapax ¬hapax Total
Speaker Sex female 25 2537 2562
(33.00) (2529.00)
male 41 2521 2562
(33.00) (2529.00)
Total 66 5058 5124
Table 9.26 shows the hapaxes for -ness in the male and female samples. The
HTRs are low, but much higher than for -ity, 0.0795 for women and 0.1023 for
men.
As Table 9.27 shows, the difference in HTRs is not statistically significant, and
the effect size would be very weak anyway (𝜒 2 = 1.93, df = 1, 𝑝 > 0.05, 𝜙 =
0.0395).
In this case, the results correspond to Säily’s, who also finds a significant dif-
ference in productivity for -ity, but not for -ness.
This case study was meant to demonstrate, once again, the method of compar-
ing TTRs and HTRs based on samples of equal size. It was also meant to draw
attention to the fact that morphological productivity may be an interesting area
of research for variationist sociolinguistics; however, it must be pointed out that
it would be premature to conclude that men and women differ in their productive
use of particular affixes; as Säily herself points out, men and women are not only
represented unevenly in quantitative terms (with a much larger proportion of
male language included in the BNC), but also in qualitative terms (the language
varieties with which they are represented differ quite strikingly). Thus, this may
actually be another case of different degrees of productivity in different language
varieties (which we investigated in the preceding case study).
351
9 Morphology
Table 9.26: Hapaxes with -ness in samples of male and female speech
(BNC)
male speech
abjectness, adroitness, aloneness, anxiousness, awfulness, barrenness, blackness,
blandness, bluntness, carefulness, centredness, cleansiness, clearness, cowardliness,
crispness, delightfulness, differentness, dizziness, drowsiness, dullness,
eyewitnesses, fondness, fulfilness, genuineness, godliness, graciousness, headedness,
heartlessness, heinousness, keenness, lateness, likeliness, limitedness, loudness,
mentalness, messiness, narrowness, nearness, neighbourliness, niceness, numbness,
pettiness, pleasantness, plumpness, positiveness, quickness, reasonableness,
rightness, riseness, rudeness, sameness, sameyness, separateness, shortness,
smugness, softness, soreness, springiness, steadiness, stubbornness, timorousness,
toughness, uxoriousness
female speech
ancientness, appropriateness, badness, bolshiness, chasifness, childishness,
chubbiness, clumsiness, conciseness, eagerness, easiness, faithfulness, falseness,
feverishness, fizziness, freshness, ghostliness, greyness, grossness, grotesqueness,
heaviness, laziness, likeness, mysteriousness, nastiness, outspokenness, pinkness,
plainness, politeness, prettiness, priggishness, primness, randomness,
responsiveness, scratchiness, sloppiness, smoothness, stiffness, stretchiness,
tenderness, tightness, timelessness, timidness, ugliness, uncomfortableness,
unpredictableness, untidiness, wetness, zombieness
Table 9.27: Hapax legomena with -ness in male and female speech
(BNC)
Type
hapax ¬hapax Total
Speaker Sex female 49 567 616
(56.00) (560.00)
male 63 553 616
(56.00) (560.00)
Total 112 1120 1232
352
10 Text
As mentioned repeatedly, linguistic corpora, by their nature, consist of word
forms, while other levels of linguistic representation are not represented unless
the corresponding annotations are added. In written corpora, there is one level
other than the lexical that is (or can be) directly represented: the text. Well-con-
structed linguistic corpora typically consist of (samples from) individual texts,
whose meta-information (author, title, original place and context of publication,
etc.) are known. There is a substantial body of corpus-linguistic research based
on designs that combine the two inherently represented variables Word (Form)
and Text; such designs may be concerned with the occurrence of words in indi-
vidual texts, or, more typically, with the occurrence of words in clusters of texts
belonging to the same language variety (defined by topic, genre, function, etc.).
Texts are, of course, produced by speakers, and depending on how much and
what kind of information about these speakers is available, we can also cluster
texts according to demographic variables such as dialect, socioeconomic status,
gender, age, political or religious affiliation, etc. (as we have done in many of
the examples in earlier chapters). In these cases, quantitative corpus linguistics
is essentially a variant of sociolinguistics, differing mainly in that the linguistic
phenomena it pays most attention to are not necessarily those most central to
sociolinguistic research in general.
1
The term keyword is frequently spelled as two words (key word) or with a hyphen (key-word).
I have chosen the spelling as a single word here because it seems simplest (at least to me, as a
native writer of German, where compounds are always spelled as single words).
10 Text
slightly broader sense of words that are characteristic of a particular text, lan-
guage variety or demographic in the sense that they occur with “unusual fre-
quency in a given text” or set of texts, where “unusual” means high “by compar-
ison with a reference corpus of some kind” (Scott 1997: 236).
In other words, the corpus-linguistic identification of keywords is analogous to
the identification of differential collocates, except that it analyses the association
of a word W to a particular text (or collection of texts) T in comparison to the
language as a whole (as represented by the reference corpus, which is typically
a large, balanced corpus). Table 10.1 shows this schematically.
Table 10.1: A generic 2-by-2 table for keyword analysis
Text
text/corpus t reference corpus Total
Word word w O11 O12 R1
other words O21 O22 R2
Total C1 C2 N
Just like collocation analysis, keyword analysis is most often applied induc-
tively, but there is nothing that precludes a deductive design if we have hypothe-
ses about the over- or underrepresentation of particular lexical items in a par-
ticular text or collection of texts. In either case, we have two nominal variables:
Keyword (with the individual words as values) and Text (with the values text
and reference corpus).
If keyword analysis is applied to a single text, the aim is typically to identify
either the topic area or some stylistic property of that text. When applied to text
categories, the aim is typically to identify general lexical and/or grammatical
properties of the language variety represented by the text categories.
As a first example of the kind of results that keyword analysis yields, con-
sider Table 10.2, which shows the 20 most frequent tokens (including punctua-
tion marks) in the LOB corpus and two individual texts (all words were converted
to lower case).
As we can see, the differences are relatively small, as all lists are dominated
by frequent function words and punctuation marks. Ten of these occur on all
three lists (a, and, in, of, that, the, to, was, the comma and the period), and an-
other six occur on two of them (as, he, it, on, and opening and closing quotation
marks – although the latter are single quotation marks in the case of LOB and
double quotation marks in the case of Text B). Even the types that occur only
354
10.1 Keyword analysis
once are mostly uninformative with respect to the language variety (or text cat-
egory) we may be dealing with (1959, at, by, for, had, is, with, the hyphen and
opening and closing parentheses). The only exceptions are four content words
in Text A: Neosho, river, species, station – these suggest that the text is about the
Neosho river and perhaps that it deals with biology (as suggested by the word
species).
Applying keyword analysis to each text or collection of texts allows us to iden-
tify the words that differ most significantly in frequency from the reference cor-
pus, telling us how the text in question differs lexically from the (written) lan-
guage of its time as a whole. Table 10.3 lists the keywords for Text A.
The keywords now convey a very specific idea of what the text is about: there
are two proper names of rivers (the Neosho already seen on the frequency list
and the Marais des Cygnes, represented by its constituents Cygnes, Marais and
des), and there are a number of words for specific species of fish as well as the
words river and channel.
355
10 Text
The text is clearly about fish in the two rivers. The occurrence of the words
station and abundance suggests a research context, which is supported by the
occurrence of two dates and opening and closing parentheses (which are often
used in scientific texts to introduce references). The text in question is indeed a
scientific report on fish populations: Fish Populations, Following a Drought, In the
Neosho and Marais des Cygnes Rivers of Kansas (available via Project Gutenberg
and in the Supplementary Online Material, file TXQP). Note that the occurrence
of some tokens (such as the dates and the parentheses) may be characteristic of a
language variety rather than an individual text, a point we will return to below.
Next, consider Table 10.4, which lists the keywords for Text B. Three things are
noticeable: the keyness of a number of words that are most likely proper names
356
10.1 Keyword analysis
(Hume, Vye, Rynch, Wass, Brodie and Jumala), pronouns (he, his) and punctua-
tion marks indicative of direct speech (the quotation marks and the exclamation
mark).
This does not tell us anything about this particular text, but taken together,
these pieces of evidence point to a particular genre: narrative text (novels, short
stories, etc.). The few potential content words suggest a particular sub-genre:
the archaic hunter in combination with the unusual word flitter is suggestive of
fantasy or science fiction. If we were to include the next twenty most strongly
associated nouns, we would find patrol, camp, needler, safari, guild, tube, planet
and out-hunter, which corroborate the impression that we are dealing with a sci-
ence-fiction novel. And indeed, the text in question is the science-fiction novel
357
10 Text
Starhunter by Andre Alice Norton (available via Project Gutenberg in the Sup-
plementary Online Material, file TXQP).
Again, the keywords identified are a mixture of topical markers and markers
for the language variety (in this case, the genre) of the text, so even a study of
the keywords of single texts provides information about more general linguistic
properties of the text in question as well as its specific topic. But keyword anal-
ysis reveals its true potential when we apply it to clusters of texts, as in the case
studies in the next section.
358
10.2 Case studies
Table 10.5: Key words in the Learned and Scientific Writing section of
LOB
Even more interestingly, keyword analysis can reveal function words that are
characteristic for a particular language variety and thus give us potential insights
into grammatical structures that may be typical for it; for example, is, the and of
are among the most significant keywords of Scientific English. The last two are
presumably related to the nominal style that is known to characterize academic
texts, while the higher-than-normal frequency of is may be due to the prevalence
of definitions, statements of equivalence, etc. This (and other observations made
on the basis of keyword analysis) would of course have to be followed up by
more detailed analyses of the function these words serve – but keyword analysis
tells us what words are likely to be interesting to investigate.
359
10 Text
cal items with other units of linguistic structure can also be applied to specific
language varieties.
For example, Marco (2000) investigates collocational frameworks (see Chap-
ter 8, Section 8.2.1) in medical research papers. While this may not sound par-
ticularly interesting at first glance, it turns out that even highly frequent frame-
works like [a __ of ] are filled by completely different items from those found in
the language as a whole, which is important for many applied purposes (such
as language teaching or machine processing of language), but which also shows
just how different language varieties can actually be. Since Marco’s corpus is
not publicly available and the Learned and Scientific Writing section of LOB is
too small for this kind of analysis, let us use the Written Academic subsection of
the BNC Baby. Table 10.6 shows the 15 most strongly associated collocates in the
framework [a __ of ], i.e. the words whose frequency of occurrence inside this
framework differs most significantly from their frequency of occurrence outside
of this framework in the same corpus section.
If we compare the result in Table 10.6 to that in Table 8.3 in Chapter 8, we
notice clear differences between the use of this framework in academic texts
and the language as a whole; for example, lot, which is most strongly associated
with the framework in the general language occurs in 9th position, while the top
collocates of the framework are more precise quantification terms like number
or series, and general scientific terms like result and function.
However, the two lists – that in Table 8.3 and that presented here – were de-
rived independently from different corpora, making it difficult to determine the
true extent of the differences. In particular, in each of the two corpora the words
in the pattern compete with the words outside of the pattern, which are obvi-
ously from the same discourse domains. To get a clearer idea of the different
function(s) that a pattern might play in two different language varieties, we can
combine collocational framework analysis and keyword analysis: we extract all
words occurring in a collocational framework (or grammar pattern, construction,
etc.) in a particular language variety, and compare them to the words occurring
in the same pattern in a reference corpus (Stefanowitsch 2017 refers to this mix of
keyword and collostructional analysis as “textually-distinctive [i.e., differential]
collexeme analysis”).
Table 10.7 shows the result of such an analysis in the BNC Baby, comparing
words occurring in the framework [a(n) _ of ] in the Written Academic section
to the words occurring in the same pattern in the rest of the corpus (all sections
other than Written Academic).
The scientific vocabulary now dominates the collocates of the framework even
more clearly than in the simple collocational framework analysis above: the in-
360
10.2 Case studies
formal a lot of and other colloquial words are now completely absent. This case
study shows the variability that even seemingly simple grammatical patterns
may display across language varieties. It is also meant to demonstrate how sim-
ple techniques like collocational-framework analysis can be combined with more
sophisticated techniques to yield more insightful results.
361
10 Text
362
10.2 Case studies
Many of the examples in the early chapters of this book demonstrate how, in
principle, lexical differences between varieties can be investigated – take two
sufficiently large corpora representing two different varieties, and study the dis-
tribution of a particular word across these two corpora. Alternatively, we can
study the distribution of all words across the two corpora in the same way as
we studied their distribution across texts or language varieties in the preceding
section.
This was actually done fairly early, long before the invention of keyword anal-
ysis, by Johansson & Hofland (1989). They compare all word forms in the LOB and
BROWN corpora using a “coefficient of difference”, essentially the percentage of
the word in the two corpora.2 In addition, they test each difference for signifi-
cance using the 𝜒 2 test. As discussed in Chapter 7, it is more recommendable –
and, in fact, simpler – to use an association measure like G right away, as per-
centages will massively overestimate infrequent events (a word that occurs only
a single time will be seen as 100 percent typical of whichever corpus it happens
to occur in); also, the 𝜒 2 test cannot be applied to infrequent words. Still, Johans-
son and Hofland’s basic idea is highly innovative and their work constitutes the
first example of a keyword analysis that I am aware of.
Comparing two (large) corpora representing two varieties will not, however,
straightforwardly result in a list of dialect differences. Instead, there are at least
five types of differences that such a comparison will uncover. Not all of them will
be relevant to a particular research design, and some of them are fundamental
problems for any research design and must be dealt with before we can proceed.
Table 10.8 shows the ten most strongly differential keywords for the LOB and
BROWN corpora. The analysis is based on the tagged versions of the two corpora
as originally distributed by ICAME.
For someone hoping to uncover dialectal differences between British and
American English, these lists are likely to be confusing, to say the least. The hy-
phen is the strongest American keyword? Quotation marks are typical for British
English? The word The is typically American? Clitics like n’t, ’s and ’m are British,
while words containing these clitics, like didn’t, it’s and I’m are American? Of
2
More precisely, in its generalized form, this coefficient is calculated by the following formula,
given two corpora A and B:
𝑓 (word𝐴 ) 𝑓 (word𝐵 )
−
size𝐴 size𝐵
𝑓 (word𝐴 ) 𝑓 (word𝐵 )
+
size𝐴 size𝐵
This formula will give us the percentage of uses of the word in Corpus A or Corpus B
(whichever is larger), with a negative sign if it occurs in Corpus B.
363
10 Text
Table 10.8: Key words of British and American English based on a com-
parison of LOB and BROWN
364
10.2 Case studies
course not – all of these apparent differences between American and British En-
glish are actually differences in the way the two corpora were prepared. The
tagged version of the BROWN corpus does not contain quotation marks because
they have intentionally been stripped from the text. The with an uppercase T
does not occur in the tagged LOB corpus, because case is normalized such that
only proper names are capitalized. And clitics are separate tokens in LOB but not
in BROWN.
In other words, the two corpora have to be made comparable before they can
be compared. Table 10.9 shows the 10 most strongly differential keywords for the
LOB and BROWN corpora respectively, after all words in both corpora have been
put into lowercase, all clitics in BROWN have been separated from their stems,
and all tokens consisting exclusively of punctuation marks have been removed,
as have periods at the end of abbreviations like mr. and st.
This list is much more insightful. There are still some artifacts of corpus con-
struction: the codes F and J are used in BROWN to indicate that letter combina-
tions and formulae have been removed. But the remainder of the keywords is
now representative of the kinds of differences a dialectal keyword analysis will
typically uncover.
First, there are differences in spelling. For example, labour and behaviour are
spelled with ou in Britain, but with o in the USA, the US-American defense is
spelled defence in Britain, and the British programme is spelled program in the
USA. These differences are dialectal and may be of interest in applied contexts,
but they are not likely to be of primary interest to most linguists. In fact, they
are often irritating, since of course we would like to know whether words like
labo(u)r or behavio(u)r are more typical for British or for American English aside
from the spelling differences. To find out, we have to normalize spellings in the
corpora before comparing them (which is possible, but labo(u)r-intensive).
Second, there are proper nouns that differ in frequency across corpora: for ex-
ample, geographical names like London, Britain, Commonwealth, and (New) York
will differ in frequency because their referents are of different degrees of inter-
est to the speakers of the two varieties. There are also personal names that differ
across corpora; for example, the name Macmillan occurs 63 times in the LOB cor-
pus but only once in BROWN; this is because in 1961, Harold Macmillan was the
British Prime Minister and thus Brits had more reason to mention the name. But
there are also names that differ in frequency because they differ in popularity in
the speech communities: for example, Mike is a keyword for BROWN, Michael
for LOB. Thus, proper names may differ in frequency for purely cultural or for
linguistic reasons; the same is true of common nouns.
365
10 Text
Table 10.9: Key words of British and American English based on a com-
parison of LOB and BROWN
366
10.2 Case studies
Third, nouns may differ in frequency not because they are dialectal, but be-
cause the things they refer to play a different role in the respective cultures. State,
for example, is a word found in both varieties, but it is more frequent in US-Amer-
ican English because the USA is organized into 50 states that play an important
cultural and political role.
Fourth, nouns may differ in frequency due to dialectal differences (as we saw
in many of the examples in previous chapters). Take toward and towards, which
mean the same thing, but for which the first variant is preferred in US-American
and the second in British English. Or take round, which is an adjective meaning
‘shaped like a circle or a ball’ in both varieties, but also an adverb with a range
of related meanings that corresponds to American English around.
This case study was mainly intended to demonstrate the difficulty of compar-
ing corpora that are not really comparable in terms of the way they have been
constructed. It was also meant to demonstrate how large-scale comparisons of
varieties of a language can be done and what kind of results they yield. From
a theoretical perspective, these results may seem to be of secondary interest, at
least in the domain of lexis, since lexical differences between the major varieties
of English are well documented. But from a lexicographical perspective, large-
scale comparisons of varieties are useful, especially because dialectal differences
are constantly evolving.
367
10 Text
(1) a. These remarkable ships and weapons, ranging the oceans, will be ca-
pable of accurate fire on targets virtually anywhere on earth. (BROWN
G35)
b. A trap for throwing these miniature clays fastens to the barrel so that
the shooter can throw his own targets. (BROWN E10)
Second, the list is not exhaustive, listing only words which show a significant
difference across the two varieties. For example, the obvious items soldier and
soldiers are missing because they are roughly equally frequent in the two vari-
eties. However, if we want to make strong claims about the role of a particular
domain of life (i.e., semantic field) in a culture, we need to take into consideration
not just the words that show significant differences but also the ones that do not.
If there are many of the latter, this would weaken the results.
368
Table 10.10: Military keywords in BROWN and LOB (cf. Leech & Fallon
1992: 49–50)
Word fW (BROWN) fW (LOB) fOTH (BROWN) fOTH (LOB) G Word fW (BROWN) fW (LOB) fOTH (BROWN) fOTH (LOB) G Word fW (BROWN) fW (LOB) fOTH (BROWN) fOTH (LOB) G
Keywords for American English (AmE contd.) (AmE contd.)
corps 110 10 1 014 202 1 012 975 97.39 patrol 25 6 1 014 287 1 012 979 12.49 strategic 23 9 1 014 289 1 012 976 6.32
missile 48 5 1 014 264 1 012 980 40.30 fire 187 125 1 014 125 1 012 860 12.32 battery 18 6 1 014 294 1 012 979 6.26
sherman 29 0 1 014 283 1 012 985 40.16 code 39 14 1 014 273 1 012 971 12.24 arms 121 85 1 014 191 1 012 900 6.28
fallout 31 1 1 014 281 1 012 984 35.42 volunteers 29 8 1 014 283 1 012 977 12.63 march 121 85 1 014 191 1 012 900 6.28
mobile 44 6 1 014 268 1 012 979 32.57 submarine 26 7 1 014 286 1 012 978 11.62 bullet 28 12 1 014 284 1 012 973 6.56
fort 55 11 1 014 257 1 012 974 31.96 division 107 64 1 014 205 1 012 921 10.87 pirates 12 3 1 014 300 1 012 982 5.77
marine 55 12 1 014 257 1 012 973 29.84 combat 27 8 1 014 285 1 012 977 10.87 targets 22 9 1 014 290 1 012 976 5.61
major 247 142 1 014 065 1 012 843 28.56 missiles 32 11 1 014 280 1 012 974 10.68 tactics 20 8 1 014 292 1 012 977 5.30
plane 114 49 1 014 198 1 012 936 26.57 rifles 23 6 1 014 289 1 012 979 10.61 war 464 396 1 013 848 1 012 589 5.30
guns 42 8 1 014 270 1 012 977 25.30 troop 16 3 1 014 296 1 012 982 9.75 armies 15 5 1 014 297 1 012 980 5.22
column 71 24 1 014 241 1 012 961 24.25 missions 16 3 1 014 296 1 012 982 9.75 marching 15 5 1 014 297 1 012 980 5.22
rifle 63 20 1 014 249 1 012 965 23.34 viet 16 3 1 014 296 1 012 982 9.75 commands 15 5 1 014 297 1 012 980 5.22
losses 46 11 1 014 266 1 012 974 23.05 force 230 168 1 014 082 1 012 817 9.62 signal 63 40 1 014 249 1 012 945 5.15
gun 118 56 1 014 194 1 012 929 22.51 victory 61 32 1 014 251 1 012 953 9.16 weapon 42 24 1 014 270 1 012 961 4.95
enemy 88 36 1 014 224 1 012 949 22.43 codes 17 4 1 014 295 1 012 981 8.64 civilian 24 11 1 014 288 1 012 974 4.93
cavalry 26 3 1 014 286 1 012 982 20.88 slug 10 1 1 014 302 1 012 984 8.54 enlisted 11 3 1 014 301 1 012 982 4.85
military 212 133 1 014 100 1 012 852 18.15 bombers 22 7 1 014 290 1 012 978 8.13 infantry 16 6 1 014 296 1 012 979 4.70
armed 60 22 1 014 252 1 012 963 18.25 signals 29 11 1 014 283 1 012 974 8.37 territorial 14 5 1 014 298 1 012 980 4.43
ballistic 17 1 1 014 295 1 012 984 17.21 manned 12 2 1 014 300 1 012 983 7.91 fought 46 28 1 014 266 1 012 957 4.40
mercenaries 12 0 1 014 300 1 012 985 16.62 battle 87 54 1 014 225 1 012 931 7.75 command 72 49 1 014 240 1 012 936 4.37
veterans 16 1 1 014 296 1 012 984 15.94 fighters 16 4 1 014 296 1 012 981 7.69 peace 198 159 1 014 114 1 012 826 4.22
warfare 43 14 1 014 269 1 012 971 15.43 victor 23 8 1 014 289 1 012 977 7.55 winchester 12 4 1 014 300 1 012 981 4.18
aircraft 70 31 1 014 242 1 012 954 15.41 lieutenant 29 12 1 014 283 1 012 973 7.24
headquarters 65 28 1 014 247 1 012 957 15.09 bombs 35 16 1 014 277 1 012 969 7.23 Key words for British English
militia 11 0 1 014 301 1 012 985 15.23 destroy 48 25 1 014 264 1 012 960 7.34 medal 7 37 1 014 305 1 012 948 22.48
squad 18 2 1 014 294 1 012 983 14.70 veteran 27 11 1 014 285 1 012 974 6.93 trench 2 15 1 014 310 1 012 970 11.27
patriot 10 0 1 014 302 1 012 985 13.85 campaigns 17 5 1 014 295 1 012 980 6.90 disarmament 11 27 1 014 301 1 012 958 6.97
strategy 22 4 1 014 290 1 012 981 13.70 assault 15 4 1 014 297 1 012 981 6.77 tanks 18 35 1 014 294 1 012 950 5.57
shot 113 65 1 014 199 1 012 920 13.04 pentagon 13 3 1 014 299 1 012 982 6.73 rank 24 43 1 014 288 1 012 942 5.49
bullets 21 4 1 014 291 1 012 981 12.65 mission 78 49 1 014 234 1 012 936 6.64 conquest 9 20 1 014 303 1 012 965 4.29
369
10.2 Case studies
10 Text
Third, the study of cultural importance cannot be separated entirely from the
study of dialectal preferences. For example, the word armistice occurs 15 times
in the LOB corpus but only 4 times in BROWN, making it a significant keyword
for British English (𝐺 2 = 6.80). However, before we conclude that British culture
is more peaceful than American culture, we should check synonyms. We find
that truce occurs 5 times in BROWN and not at all in LOB, making it an equally
significant keyword for American English (𝐺 2 = 6.92). Finally, cease-fire occurs
7 times in each corpus. In other words, the two cultures differ not in the impor-
tance of cease-fires, but in the words they use to denote them – similar dialectal
preferences may well underlie other items on Leech and Fallon’s list.
Overall, however, Leech & Fallon (1992) were careful to include words that
occur in military contexts in a substantial number of instances and to cover the
semantic field broadly (armistice and truce were two of the few words I was able
to think of that turned out to have significant associations). Thus, their conclu-
sion that the concept WAR played a more central role in US culture in 1961 than
it did in British culture seems reliable.
This case study is an example of a very carefully constructed and executed
contrastive cultural analysis based on keywords. Note, especially, that Leech and
Fallon do not just look for semantic fields that are strongly represented among
the statistically significant keywords of one corpus, but that they check the entire
semantic field (or a large portion of it) with respect to its associations in both
corpora. In other words, they do not look only for evidence, but also for counter-
evidence, something that is often lacking in cultural keyword studies.
370
10.2 Case studies
two corpora. Unlike Leech and Fallon in the study described above, it seems that
they did not create a complete list of differential keywords and then categorized
them into semantic fields, but instead focused on words for kinship relations,
spiritual entities and witchcraft straight away.
This procedure yields seemingly convincing word lists like that in Table 10.11,
which the authors claim shows “the salience of authority and respect and the
figures that can be associated with them” Wolf & Polzenhagen (2007: 420).
Table 10.11: Keywords relating to authority and respect in a corpus of
Cameroon English (from Wolf & Polzenhagen 2007: 421).
371
10 Text
With respect to the latter, note that it is also questionable whether one can
simply combine a British and an American corpus to represent “Western” cul-
ture. First, it assumes that the two cultures individually belong to such a larger
culture and can jointly represent it. Second, it assumes that these two cultures do
not accord any specific importance to whatever domain we are looking at. How-
ever, especially if we choose our keywords selectively, we could easily show that
Authority has a central place in British or American culture. Table 10.12 lists ten
significantly associated keywords from each corpus, resulting from the direct
comparison discussed in 10.2.2. By presenting just one or the other list, we could
make any argument about Britain, the USA and authority that suits our purposes.
This case study demonstrates some of the potential pitfalls of cultural keyword
analysis. This is not to suggest that Wolf & Polzenhagen (2007) is methodologi-
cally flawed, but that the way they present the results does not allow us to deter-
mine the methodological soundness of their approach. Generally, semantic do-
mains should be extracted from corpora as exhaustively as possible in such anal-
yses, and all results should be reported. Also, instead of focusing on one culture
and using another culture as the reference corpus, it seems more straightforward
to compare specific cultures (for example in the sense of “speech communities
within a nation state”) directly, as Leech & Fallon (1992) do (cf. also Oakes &
Farrow 2007 for a method than can be used to compare more than two varieties
against each other in a single analysis).
372
10.2 Case studies
Table 10.12: Key words from the domain Authority in British and
American English
373
10 Text
374
10.2 Case studies
Table 10.13: Differential collocates for men and women in the domain
personal reference
We see the same general difference as before, so the result is not due to differ-
ent style preferences of men and women, but to the semantic field under investi-
gation.
Many other fields that Schmid investigates make it much more difficult to
come up with a plausibly representative sample of words. For example, for the
domain health and body, Schmid looks at breast, hair, headache, legs, sore throat,
doctor, sick, ill, leg, eyes, finger, fingers, eye, body, hands, and hand and finds that
with the exception of hand they are all more frequently used by women. The se-
lection seems small and rather eclectic, however, so let us enlarge the set with the
words ache, aching, flu, health, healthy, influenza, medicine, nurse, pain and un-
well from the domain health and arms, ear, ears, feet, foot, kidneys, liver, mouth,
muscles, nose, penis, stomach, teeth, thumb, thumbs, tooth, vagina and vulva from
the domain body. Table 10.15 shows the results.
375
10 Text
Table 10.14: More differential collocates for men and women in the do-
main personal reference
376
10.2 Case studies
Table 10.15: Differential collocates for men and women in the domain
body and health
377
10 Text
The larger sample supports Schmid’s initial conclusion, but even the larger
sample is far from exhaustive. Still this case study has demonstrated that if we
manage to come up with a justifiable selection of lexical items, a deductive key-
word analysis can be used to test a particular hypothesis in an efficient and prin-
cipled way.
379
10 Text
Note that one difficulty with sociolinguistic research focusing on lexical items
is that topical differences in the corpora may distort the picture. For example,
among the female keywords we find words like kitchen, baby, biscuits, husband,
bedroom, and cooking which could be used to construct a stereotype of women’s
language as being home- and family-oriented. In contrast, among the male key-
words we find words like minus, plus, percent, equals, squared, decimal as well
as many number words, which could be used to construct a stereotype of male
language as being concerned with abstract domains like mathematics. However,
these differences very obviously depend on the topics of the conversations in-
cluded in the corpus. It is not inconceivable, for example, that male linguists
constructing a spoken corpus will record their male colleagues in a university
setting and their female spouses in a home setting. Thus, we must take care to
distinguish stable, topic-independent differences from those that are due to the
content of the corpora investigated. This should be no surprise, of course, since
keyword analysis was originally invented to uncover precisely such differences
in content.
10.2.4 Ideology
Just as we can choose texts to stand for demographic variables, we can choose
them to stand for the world views or ideologies of the speakers who produced
them. Note that in this case, the texts serve as an operational definition of the
corresponding ideology, an operationalization that must be plausibly justified.
380
10.2 Case studies
Table 10.17: Differential collocates for the Labour and Liberal Democrat
manifestos (2001)
381
10 Text
mentions hypothetical events more frequently, which Rayson takes to mean that
they did not expect to win the election.
Going beyond Rayson’s discussion of individual words, note that the Labour
manifesto does not have any words relating to specific policies among the ten
strongest keywords, while the Liberal Democrats have green and environmental,
pointing to their strong environmental focus, as well as powers, which, when
we look at the actual manifesto, turns out to be due to the fact that they are
very concerned with the distribution of decision-making powers. Why might
this be the case? We could hypothesize that since the Labour Party was already
in power in 2001, they might have felt less of a need than the Liberal Democrats
to mention specific policies that they were planning to implement. Support for
this hypothesis comes from the fact that the Liberal Democrats not only use the
word would more frequently than Labour, but also the word will.
In order to test this hypothesis, we would have to look at a Labour election
manifesto during an election in which they were not in power: the prediction
would be that in such a situation, we would find words relating to specific policies.
Let us take the 2017 election as a test case. There are two ways in which we could
now proceed: We could compare the Labour 2017 manifesto to the 2001 manifesto,
or we could simply repeat Rayson’s analysis and compare the 2017 manifestos
of Labor and the Liberal Democrats. To be safe, let us do both (again, the 2017
manifestos, converted into comparable form, are found in the Supplementary
Online Material, file TXQP).
Table 10.18 shows the results of a comparison between the 2017 Labour and
Liberal Democrat manifestos and Table 10.19 shows the results of the comparison
between the 2001 and 2017 Labour manifestos. In both cases, only the keywords
for the Labour 2017 manifesto are shown, since these are what our hypothesis
relates to.
The results of both comparisons bear out the prediction: most of the significant
keywords in the 2017 manifesto relate to specific policies. The comparison with
the Liberal Democrat manifesto highlights core Labour policies, with words like
workers, unions, women and workplace. The comparison with 2001 partially high-
lights the same areas, suggesting a return to such core policies between the 2001
“New Labour” era of Tony Blair and the 2017 “radical left” era of Jeremy Corbyn.
The comparison also highlights the topical dominance of the so-called Brexit (a
plan for the UK to leave the European Union): this is reflected in the word Brexit
itself, but likely also in words like ensure, protect and protections, and businesses,
which refer to the economic consequences of the so-called Brexit. Of course, the
fact that our prediction is borne out does not mean that the hypothesis about
382
10.2 Case studies
Table 10.18: Differential collocates for the 2017 Labour manifesto (com-
parison to Liberal Democrats)
being or not being in power is correct. It could simply be that Labour was not
particularly political in 2001 and has generally regained a focus on issues.
This case study has demonstrated that keyword analysis can be used to in-
vestigate ideological differences through linguistic differences. In such investi-
gations, of course, identifying keywords is only the first step, to be followed by a
closer analysis of how these keywords are used in context (cf. Rayson 2008, who
presents KWIC concordances of some important keywords, and Scott 1997, who
identifies collocates of the keywords in a sophisticated procedure that leads to
highly insightful clusters of keywords).
One issue that needs consideration is whether in the context of a specific re-
search design it is more appropriate to compare two texts potentially represent-
ing different ideologies directly to each other, as Rayson does, or whether it is
more appropriate to compare each of the two texts to a large reference corpus,
as the usual procedure in keyword analysis would be. In the first case, the focus
will necessarily be on differences, as similarities are removed from the analysis
by virtue of the fact that they will not be statistically significant – we could call
383
10 Text
Table 10.19: Differential collocates for the 2017 Labour manifesto (com-
parison to 2001)
this procedure differential keyword analysis. In the second case, both similarities
and differences could emerge; however, so would any vocabulary that is associ-
ated with the domain of politics in general. Which strategy is more appropriate
depends on the aims of our study.
384
10.2 Case studies
Observed Expected 𝜒2
Pronoun male 18 116 13 241 1794.85
female 8366 13 241 1794.85
Total 26 482 3589.70
385
10 Text
Table 10.21: Male and female pronouns in the different text categories
of the LOB corpus
Pronoun
Text Category male female Total
a. press: reportage Obs.: 1615 Obs.: 308 1923
Exp.: 1315.50 Exp.: 607.50
𝜒 2: 68.19 𝜒 2: 147.65
b. press: editorial Obs.: 576 Obs.: 104 680
Exp.: 465.18 Exp.: 214.82
𝜒 2: 26.40 𝜒 2: 57.17
c. press: reviews Obs.: 686 Obs.: 217 903
Exp.: 617.73 Exp.: 285.27
𝜒 2: 7.54 𝜒 2: 16.34
d. religion Obs.: 462 Obs.: 32 494
Exp.: 337.94 Exp.: 156.06
𝜒 2: 45.54 𝜒 2: 98.62
e. skills and Obs.: 325 Obs.: 97 422
hobbies Exp.: 288.68 Exp.: 133.32
𝜒 2: 4.57 𝜒 2: 9.89
f. popular lore Obs.: 1157 Obs.: 678 1835
Exp.: 1255.30 Exp.: 579.70
𝜒 2: 7.70 𝜒 2: 16.67
g. belles lettres, Obs.: 2876 Obs.: 787 3663
biography, memoirs Exp.: 2505.81 Exp.: 1157.19
𝜒 2: 54.69 𝜒 2: 118.42
h. miscellaneous Obs.: 227 Obs.: 70 297
Exp.: 203.17 Exp.: 93.83
𝜒 2: 2.79 𝜒 2: 6.05
j. learned Obs.: 1079 Obs.: 101 1180
Exp.: 807.22 Exp.: 372.78
𝜒 2: 91.50 𝜒 2: 198.14
k. general fiction Obs.: 2058 Obs.: 1449 3507
Exp.: 2399.09 Exp.: 1107.91
𝜒 2: 48.50 𝜒 2: 105.01
l. mystery and Obs.: 1900 Obs.: 834 2734
detective fiction Exp.: 1870.29 Exp.: 863.71
𝜒 2: 0.47 𝜒 2: 1.02
m. science fiction Obs.: 336 Obs.: 79 415
Exp.: 283.90 Exp.: 131.10
𝜒 2: 9.56 𝜒 2: 20.71
n. adventure and Obs.: 2373 Obs.: 1266 3639
western fiction Exp.: 2489.39 Exp.: 1149.61
𝜒 2: 5.44 𝜒 2: 11.78
p. romance and Obs.: 2112 Obs.: 2213 4325
love story Exp.: 2958.68 Exp.: 1366.32
𝜒 2 : 242.29 𝜒 2 : 524.67
r. humor Obs.: 334 Obs.: 131 465
Exp.: 318.10 Exp.: 146.90
𝜒 2: 0.79 𝜒 2: 1.72
Total 18116 8366 26482
386
10.2 Case studies
387
10 Text
388
10.2 Case studies
389
10 Text
390
10.2 Case studies
period and sleep, faint, be and arguably stand in the second period). Still, it might
be seen as corroboration of our expectation that the very non-agentive be is a
significant collexeme for the second period.
Table 10.24 shows the results for a direct comparison of the second against the
third period.
Table 10.24: Textually differential collexemes of the going-to future in
1780–1850 vs. 1850–1920
Here, the results are somewhat clearer: While the second period now has just
one potential non-agentive verb (reside), the third period has five (cry, happen,
live, end and arguably stay).
This case study is meant to demonstrate the use of collostructional analysis
and keyword analysis, specifically, textual differential collexeme analysis, as a
391
10 Text
392
10.2 Case studies
0.0020
0.0020
0.0015
0.0015
Frequency of god (Google Books)
0.0010
0.0005
0.0005
0.0000
0.0000
1800 1850 1900 1950 2000 1850 1900 1950 2000
Year Year
393
10 Text
2.5e-07
English
2.0e-07
German
1.5e-07
Frequency
1.0e-07
5.0e-08
0.0e+00
Year
Figure 10.2: The name Marc Chagall in the US-English and the German
parts of the Google Books corpus
to be “degenerate” and confiscated from museums, and it makes sense that his
name would not be mentioned in books written in Nazi Germany. However, the
question is, again, what conclusions to draw from such an analysis. Specifically,
we know how to interpret the drop in frequency of the name Marc Chagall dur-
ing the Nazi era in Germany because we know that Marc Chagall’s works were
banned. But if we did not know this, we would not know how to interpret the
change in frequency, since words, especially names, may rise or fall in frequency
for all kinds of reasons.
Consider the following figure, which shows the development of the frequency
of the name Karl Marx in the German and English Google Books archive (ex-
tracted from the bigram files downloaded from the Google Books site, see Sup-
plementary Online Material, file CUBF). Note the different frequency scales –
the name is generally much more frequent in German than in English, but what
interests us are changes in frequency.
Again, we see a rise in frequency in the 1920s, and then a visible decrease dur-
ing the Nazi era from 1933 to 1945. Again, this can plausibly be seen as evidence
394
10.2 Case studies
1.5e-05
1.2e-04
German
English
1.0e-05
8.0e-05
Frequency (German)
Frequency (English)
5.0e-06
4.0e-05
0.0e+00
0.0e+00
1880 1900 1920 1940 1960 1980 2000
Year
Figure 10.3: The name Karl Marx in the English and the German parts
of the Google Books corpus
for censorship in Nazi Germany. Plausibly, because we know that the Nazis cen-
sored Karl Marx’s writings – they were among the first books to be burned in
the Nazi book burnings of 1933. But what about other drops in frequency, both
in English and in German? There are some noticeable drops in frequency in En-
glish: after 1920, between 1930 and 1940 (with some ups and downs), and at the
beginning of the 1950s. Only the latter could plausibly be explained as the result
of (implicit) censorship during the McCarthy era. Finally, the frequency drops
massively in both languages after 1980, but there was no censorship in either
speech community. A more plausible explanation is that in the 1980s, neoliberal
capitalism became a very dominant ideology, and Marx’s communist ideas sim-
ply ceased to be of interest to many people (if this explanation is correct, we
might see the frequency of the name rise again, given the current wide-spread
disillusionment with neoliberalism).
Thus, the rise and fall in frequency cannot be attributed to a particular cause
without an investigation of the social, economic and political developments dur-
ing the relevant period. As such, culturomics can at best point us towards po-
tentially interesting cultural changes that then need to be investigated in other
395
10 Text
disciplines. At worst, it will simply tell us what we already know from those
other disciplines. In order to unfold their potential, such analyses would have
to be done at a much larger scale – the technology and the resources are there,
and with the rising interest in digital humanities we might see such large-scale
analyses at some point.
396
11 Metaphor
The ease with which corpora are accessed via word forms is an advantage as
long as it is our aim to investigate words, for example with respect to their rela-
tionship to other words, to their internal structure or to their distribution across
grammatical structures and across texts and language varieties. As we saw in
Chapter 8, the difficulty of accessing corpora at levels of linguistic representa-
tion other than the word form is problematic where our aim is to investigate
grammar in its own right, but since grammatical structures tend to be associated
with particular words and/or morphemes, these difficulties can be overcome to
some extent.
When it comes to investigating phenomena that are not lexical in nature, the
word-based nature of corpora is clearly a disadvantage and it may seem as though
there is no alternative to a careful manual search and/or a sophisticated annota-
tion (manual, semi-manual or based on advanced natural-language technology).
However, corpus linguists have actually uncovered a number of relationships
between words and linguistic phenomena beyond lexicon and grammar without
making use of such annotations. In the final chapter of this book, we will discuss
a number of case studies of one such phenomenon: metaphor.
However, as stressed in various places throughout this book, the manual an-
notation of corpora severely limits the amount of data that can be included in a
research design; this does not invalidate manual annotation, but it makes alter-
natives highly desirable. Two broad alternatives have been proposed in corpus
linguistics. Since these were discussed in some detail in Chapter 4, we will only
repeat them briefly here before illustrating them in more detail in the case stud-
ies.
398
11.2 Case studies
there is a specific reason in the semantics of the target domain that precludes
this (Lakoff 1993).
1. activity, with the metaphors high activity is heat and low activity
is coldness, as in cold/hot war, hot pursuit, hot topic, etc.). This sense is
not recognized by the dictionaries, except insofar as it is implicit in the
definitions of cold war, hot pursuit, cold trail, etc. It is understood here to
include a sense of hot described in dictionaries as “currently popular” or
“of immediate interest” (e.g. hot topic).
2. affection, with the metaphors affection is heat and indifference is
coldness, as in cold stare, warm welcome, etc. This sense is recognized by
all dictionaries, but we will interpret it to include sense connected with
sexual attraction and (un)responsiveness, e.g. hot date.
399
11 Metaphor
Table 11.1 lists the token frequencies of the four adjectives with each of these
broad metaphorical categories as well as all types instantiating the respective cat-
egory. There is one category that does not show any significant deviations from
the expected frequencies, namely the infrequently instantiated category synes-
thesia. For all other categories, there are clear differences that are unexpected
from the perspective of conceptual metaphor theory.
The category activity is instantiated only for the words cold and hot and its ab-
sence for the other two words is significant. We can imagine (and, in a sufficiently
large data set, find) uses for cool and warm that would fall into this category. For
example, Frederick Pohl’s 1981 novel The Cool War describes a geopolitical situa-
tion in which political allies sabotage each other’s economies, and it is occasion-
ally used to refer to real-life situations as well. But this seems to be a deliberate
analogy rather than a systematic use, leaving us with an unexpected gap in the
middle of the linguistic scale between hot and cold.
The category affection is found with three of the four words, but its absence
for the word cool is statistically significant, as is its clear overrepresentation with
warm. This lack of systematicity is even more unexpected than the one observed
with activity: for the latter, we could argue that it reflects a binary distinction
that uses only the extremes of the scale, for example because there is not enough
of a potential conceptual difference between a cold war and a cool war. With
400
Table 11.1: temperature metaphors (BNC Baby)
Adjective
Noun cold cool warm hot Total
activity Obs.: 13 Obs.: 0 Obs.: 0 Obs.: 9 22
Exp.: 8.36 Exp.: 4.62 Exp.: 3.96 Exp.: 5.06
𝜒 2: 2.58 𝜒 2: 4.62 𝜒 2: 3.96 𝜒 2: 3.07
storage, turkey, war – – pursuit, seat, spot, time, war
401
11.2 Case studies
402
11.2 Case studies
(2) a. ...the flames of civil war engulfed the central Yugoslav republic. (BNC
AHX)
b. The game was going OK and then it went up in flames. (BNC CBG)
Deignan studies this potential difference systematically based on a sample of
more than 1500 hits for flame/s in the Bank of English (a proprietary, non-acces-
sible corpus owned by HarperCollins), from which she manually extracts all 153
metaphorical uses. These are then categorized according to their connotation.
Deignan’s design thus has two nominal variables: Word Form of Flame (with
the variables singular and plural) and Connotation of Metaphor (with the
values positive and negative. She does not provide an annotation scheme for
categorizing the metaphorical expressions, but she provides a set of examples
that are intuitively quite plausible. Table 11.2 shows her results (𝜒 2 = 53.98, df =
1, 𝑝 < 0.001).
Table 11.2: Positive and negative metaphors with singular and plural
forms of flame (Deignan 2006: 117)
403
11 Metaphor
1 ford base . The jet crashed in a ball of [flame] , destroying 15 cars and damaging 10 mo
2 face down , applying the cheroot to the [flame] . But his eyes never left the four men
3 ls of a child 's that had passed through [flame] and were partially melted . They would
4 ontainer next to him . An orange ball of [flame] ripped up into the sky , bathing the de
5 went out of one door but then a sheet of [flame] came down and blocked me , so I had to
6 . The fire burns evenly with a thin hot [flame] , as though there are no oils or resins
7 ill-smouldering logs , fanning them into [flame] . He places some more logs from a pile
8 of sherry to the momentary blue veil of [flame] on the pudding , been what she would ha
9 truck one and cupped his hand around the [flame] . ` Cheers , ' said the man and dis
10 ed with the element , burning circles of [flame] round creatures she had demanded Ariel
11 the soft promise of the light burst into [flame] ; the vanguard of the islanders fell ba
12 arched for his lighter , and touched the [flame] to the tip to make contact with him . I
13 again , tighter this time , guiding the [flame] . She sucked , and the cigarette end gl
14 There , ' she shouted , pluming liquid [flame] from one claw , ` you 're not the onl
15 and steady to bring the cigarette to the [flame] and kept it for a few seconds longer th
16 ern horns outside their house , the weak [flame] of the candles fluttering in their prot
17 the upstairs windows , a sudden spurt of [flame] , and a part of the roof begin to sag o
18 however , disappear in a white sheet of [flame] . He just kept right on kicking Pikey ,
19 ue to finish . Do n't cook over a fierce [flame] . The outside of the food will cook bef
20 no cushion . Candle erm Church , steeple [Flame] . Steeple . Got it got it got it got it
21 it was winning its battle to put out the [flames] . He had to do it now , while it was s
22 ner , arm raised . Its back was to him , [flames] still glowing deep in its side . He ra
23 how the roof caved in before a sheet of [flames] spread across the fuselage , cutting h
24 control down a steep hill and burst into [flames] . The fully-laden truck careered throu
25 s spent more than two hours fighting the [flames] , police said . Bowbazaar , in the cen
26 h a shining chair by a fire with fragile [flames] . These images had what Alexander desi
27 lood rejected -- racks of fragile spiked [flames] of votive candles , elaborate china an
28 ce , looking into the authentic fake gas [flames] as he sipped his drink . He touched hi
29 ross to the fireplace , staring into the [flames] . ` There 's no reason why he should
30 the ground and died , no explosions , no [flames] reaching to the sky . It simply flippe
31 ack of her head , protected her from the [flames] and blocked out any further damage to
32 W went first , its roof torn open by the [flames] and blast as if by a giant unseen can
33 stunned and wearied by the water and the [flames] , the howling and frantic clangour of
34 annonballs , and caught the smell of the [flames] , of split flesh , and heard the howls
35 . And the best is yet to come . ' ` The [flames] of hell ? ' ` Exactly . Operatic
36 ds , then night crept back in around the [flames] . Trails of burning liquid spiderwebbe
37 e properly . Susan reeled away from us , [flames] springing up where she had been touche
38 , and his leg broke in two places . The [flames] were dying down . I could see his blac
39 mes and lower temperatures to reduce the [flames] . ` Eventually all the cooking was do
40 s . This is Crystal Palace going up in f [flames] . November the thirtieth nineteen thir
404
11.2 Case studies
It does seem that negative connotations are found more frequently with literal
uses of the plural form flames than with literal uses of the singular form flame.
Despite the small size of the sample used here, this difference only just fails to
reach statistical significance (𝜒 2 = 3.64, df = 1, 𝑝 = 0.0565). The difference would
likely become significant if we used a larger sample. However, it is nowhere
near as pronounced as in the metaphorical uses presented by Deignan. A crucial
difference between literal and metaphorical uses may be that fire is inherently
dangerous and so literal references to fire are more likely to be negative than
metaphorical ones, that allow us to focus on other aspects of fire. Interestingly,
however, most of the negative uses of singular flame occur in constructions like
ball of flame, sheet of flame and spurt of flame, where flame could be argued to be
a mass noun rather than a true singular form. If we remove these five uses, then
the difference between singular and plural becomes very significant even in the
now further reduced sample (𝜒 2 = 8.58, df = 1, 𝑝 < 0.01).
Thus, Deignan’s explanation appears to be generally correct, providing evi-
dence for a substantial degree of isomorphism between literal and figurative uses
of (at least some) words. An analysis of more such cases could show whether this
isomorphism between literal and metaphorical uses is a general principle (as the
conceptual theory of metaphor as Lakoff (1993) suggests it should be.
405
11 Metaphor
This case study demonstrates first, how to approach the study of metaphor
starting from source-domain words, and, second, that such an approach may
be applied not just descriptively, but in the context of answering fundamental
questions about the nature of metaphor.
(3) a. [I]t has taken until the dawn of the 21st century to realise that the best
methods of utilising . . . our woodlands are those employed a millen-
nium ago. (BNC AHD)
b. Communal life survived until the beginning of the nineteenth century
and traditions peculiar to that way of life had lingered into the present.
(BNC AEA)
Concepts representing entities that have a simple shape and/or have a clear
boundary are less complex than those representing entities with complex
shapes or fuzzy boundaries (because they are more easily delineable). This
follows from the gestalt principles of closure and simplicity (Stefanowitsch
2005: 170).
406
11.2 Case studies
For each pair of expressions, the differential collexemes are identified and the
resulting lists are compared against these axiomatic assumptions. Let us illustrate
this using the pattern the dawn/beginning of NP. A case insensitive query for the
string dawn or beginning, followed by of, followed by up to three words that are
not a noun, followed by a noun yields the results shown in Table 11.4 (they are
very similar to those based on a more careful manual extraction in Stefanowitsch
2005).
Table 11.4: Differential collexemes of beginning of NP and dawn of NP
(BNC)
407
11 Metaphor
era, culture). The one apparent exception is day, but this occurs exclusively in lit-
eral uses of dawn of, such as It was the dawn of the fourth day since the murder
(BNC CAM). This (and similar results for other pairs of expressions) are pre-
sented in Stefanowitsch (2005) as evidence for a cognitive function of metaphor.
In a short discussion of this study, Liberman (2005) notes in passing that even
individual decades and centuries may differ in the degree to which they prefer
beginning of or dawn of : using internet search engines, he shows that dawn of
the 1960s is more probable than dawn of the 1980s compared to beginning of the
1960s/1980s, and that dawn of the 21st century is more probable than dawn of the
18th century compared to beginning of the 18th/21st century. He rightly points
out that this seems to call into question the properties of boundedness and well-
defined length that Stefanowitsch (2005) appeals to, since obviously all decades/
centuries are equally bounded.
Since search engine frequency data are notoriously unreliable, let us replicate
this observation in a large corpus, the 400+ million word Corpus of Contempo-
rary American English (COCA). The names of decades (such as 1960s or sixties)
occur too infrequently with dawn of in this corpus to say anything useful about
them, but the names of centuries are frequent enough for a differential collexeme
analysis.
Table 11.5 shows the percentage of dawn of for the past ten centuries (spelling)
variants of the respective centuries, such as 19th century, nineteenth century, etc.)
as well as spelling errors were normalized to the spelling shown in this table.
There are clear differences between the centuries associated with dawn and
those associated with beginning: the literal expression is associated with the
past (nineteenth, seventeenth (just below significance)), while the metaphorical
expression, as already observed by Liberman, is associated with the twenty-first
century, i.e., the future (the expressions a new, our new and the incoming also
support this). I would argue that this does point to a difference in boundedness
and duration. While all centuries are objectively speaking, of the same length and
have the same clear boundaries, it seems reasonable to assume that the past feels
more bounded than the future because it is actually over, and we can imagine it
in its entirety. In contrast, none of the speakers in the COCA will live to see the
end of the 21st century, making it conceptually less bounded to them.
If this is true, then we should be able to observe the same effect in the past:
When the twentieth century was still the future, it, too, should have been associ-
ated with the metaphorical dawn of. Let us test this hypothesis using the Corpus
of Historical American English, which includes language from the early nine-
teenth to the very early twenty-first century – in a large part of the corpus, the
twentieth century was thus entirely or partly in the future. Table 11.6 shows the
differential collexemes of the two expressions in this corpus.
408
11.2 Case studies
409
11 Metaphor
410
11.2 Case studies
Expression
Text Category dawn of beginning of Total
prose Obs.: 54 Obs.: 1324 1378
Exp.: 50.72 Exp.: 1327.28
𝜒 2: 0.21 𝜒 2: 0.01
miscellaneous Obs.: 30 Obs.: 653 683
published Exp.: 25.14 Exp.: 657.86
𝜒 2: 0.94 𝜒 2: 0.04
fiction Obs.: 28 Obs.: 230 258
Exp.: 9.50 Exp.: 248.50
𝜒 2: 36.05 𝜒 2: 1.38
newspaper Obs.: 13 Obs.: 220 233
Exp.: 8.58 Exp.: 224.42
𝜒 2: 2.28 𝜒 2: 0.09
academic Obs.: 11 Obs.: 731 742
Exp.: 27.31 Exp.: 714.69
𝜒 2: 9.74 𝜒 2: 0.37
unpublished Obs.: 1 Obs.: 162 163
Exp.: 6.00 Exp.: 157.00
𝜒 2: 4.17 𝜒 2: 0.16
spoken Obs.: 0 Obs.: 265 265
(all) Exp.: 9.75 Exp.: 255.24
𝜒 2: 9.75 𝜒 2: 0.37
411
11 Metaphor
412
11.2 Case studies
1 rences in the way of life and pursuit of [happiness] , differences in our social system and
2 and laughter , he feels , engender more [happiness] than politics or philanthropy . at a me
3 experiences the true meaning of love and [happiness] . ' X-certificate . Phillipe Lemarre ha
4 used of experiences of life and death , [happiness] and sorrow ( cf Job 9.25 ; Ps 16.10 ; I
5 or an ultimate goal to the merriment and [happiness] that life does contain in some of its s
6 says about the relation of goodness and [happiness] . most people know Heine's brilliant je
7 duty is not concerned with consequence : [happiness] is concerned with nothing else . here w
8 about the supreme good - which includes [happiness] . A E Taylor has said that what disting
9 ending improvement need not mean perfect [happiness] there any more than here . but after se
10 ew moral intuition . ` that goodness and [happiness] ought to go together , and the existenc
11 he seems to have overcome the dualism of [happiness] and duty but at a cost . he has been vi
12 dly meets the problem ` does Kant regard [happiness] as a good thing or not ? ' the answer w
13 we prove ourselves worthy or unworthy of [happiness] in the next . but in this life is it no
14 n this life is it not lawful to seek the [happiness] of others ? on stern Kantian grounds ,
15 ng attitudes , to a life of fulfilment , [happiness] and success . as each year passes the s
16 e and the car , will not bring increased [happiness] to our increased leisure . nor will the
17 ge of this fundamental truth - that real [happiness] and satisfaction is found in doing for
18 hich no one will read . '' sign here for [happiness] . Judith Simons meets a woman who share
19 hrough it - that moment when all hope of [happiness] seems lost for ever . they said they 'd
20 nts are able to provide tranquillity and [happiness] within the home itself , and in their d
21 not only in money but in the health and [happiness] of its people and the enhanced prestige
22 hey studied Richard Lucas' enquiry after [happiness] , Norris' sermons , Stephen's letters a
23 ctively in the sacrifice of her sister's [happiness] , or in consolidating her own usurpatio
24 he summertime , sent her into shrieks of [happiness] . she loved bright objects and pleasant
25 us beauty of the Latin liturgy ` a vital [happiness] ' . it was to him a means of mediation
26 ture . to all who have retired , we wish [happiness] and long life . research leaders honour
27 her told stories about the war a curious [happiness] came over him which the stories themsel
28 tomorrow afternoon ? ' he felt a glow of [happiness] steal over him . everything was all rig
29 ee you happy . ' ` there will n't be any [happiness] for me until I can prove him guilty . '
30 ith an undescribable expression of utter [happiness] . seeing Heather he came to her and dan
31 in turn with an expression of ineffable [happiness] on his flat face . quickly taking his c
32 ommand . Sirisec . '' he looked up , all [happiness] gone from his leathery features . ` oh
33 . though he never expected to attain the [happiness] he yearned for in a daughter-in-law and
34 West again . Barry had brought her more [happiness] than she had ever known was possible ,
35 ut there 'll be sons for you - aye , and [happiness] , too - when Helen 's gone from your si
36 y known before that there was no hope of [happiness] in the future for her and Gavin . if he
37 is own love for her , his desire for her [happiness] . far better that she should believe hi
38 e word . Nicholas , Philip ... where was [happiness] , or peace of mind ? Philip put out a h
39 ved Sandra too deeply to ruin her future [happiness] . had ever circumstances conspired so c
40 tood there staring at Julia with all the [happiness] draining out of her pretty little face
41 a burden to be endured and never never a [happiness] to be anticipated . now , her young mou
42 wards he had believed that she had found [happiness] with the bluff sailor and he 'd been ge
43 y nothing of this . it concerns Missie's [happiness] . '' so that was it ! someone was anxio
44 k . ' Mollie followed him , bemused with [happiness] . she moved on a cloud , floating effor
45 they sat for an hour , bemused by their [happiness] , feeling that all things were possible
46 on of Dorcas and Adrian Mallory , of the [happiness] of that girl on the eve of her marriage
47 change , even for a fortnight , the warm [happiness] of being with Neil , of sharing with hi
48 mented minute had been a tiny stretch of [happiness] . he leaned from the carriage window an
49 ery on a fast vanishing hope of ultimate [happiness] . Betty was right . Kay must not be for
50 ' ` yes , and we 'll drink to our future [happiness] , Bill ! ' she answered , raising her f
51 s of golden sunshine and music and utter [happiness] . the knowledge that she might never se
52 em , Tandy felt a private little glow of [happiness] . for so long , now , she 'd felt respo
53 re so selfish . " " The whole concept of [happiness] , mother dearest , is outdated . Your ph
413
11 Metaphor
find NPEMOT ] (line 42), with NPEXP indicating the slot for the noun referring to
the experiencer of the emotion. Note that, as is typical in metaphorical pattern
analysis, the patterns are generalized with respect to the slot of the emotion noun
(we could find the same patterns with other emotions), and they are relatively
close in form to the actual citation. We could subsume the passive in line 17 under
the same pattern as the active in line 42, of course, but there might be differences
across emotion terms, varieties, etc. concerning voice (and other formal aspects
of the pattern), and there is little gained by discarding this information.
The transfer metaphor is also instantiated a number of times in the concor-
dance, namely as [NPSTIM bring NPEMOT ] (lines 16 and 42), and [NPSTIM provide
NPEMOT ] (line 20).
Additional clear cases of metaphorical patterns are [glow of NPEMOT ] (lines
28 and 53) and [warm NPEMOT ] (line 47), which instantiate the metaphor hap-
piness is warmth, and [NPEMOT drain out of NPEXP ’s face] (line 40), which
instantiates happiness is a liquid filling the experiencer. In other cases, it
depends on our judgment (which we have to defend within a given research
design) whether a hit constitutes a metaphorical pattern. For example, do we
want to analyze [NPSTIM ’s NPEMOT ] (lines 23 and 43) and [PRON.POSS.STIM
NPEMOT ] (lines 37, 39, 45, 50) as happiness is a possessed object, or do we con-
sider the possessive construction to be too abstract semantically to be analyzed
as metaphorical? Similarly, do we analyze [NPSTIM engender NPEMOT ] (line 2) as
an instance of happiness is an organism, based on the etymology of engender,
which comes from Latin generare beget and was still used for organisms in Mid-
dle English (cf. Chaucer’s ...swich licour, / Of which vertu engendred is the flour)?
You might want to think about these and other cases in the concordance, to get a
sense of the kind of annotation scheme you would need to make such decisions
on a principled, replicable basis.
For now, let us turn to Indian English. Table 11.3 shows the hits for the query
⟨ [word="happiness"%c] ⟩ in the Kolhapur corpus. Here, 35 hits for the phrase
harmonious happiness have been removed, because they all came from one text
extolling the virtues of the principle of harmonious happiness (17 hits), (moral)
standard of harmonious happiness (15 hits), (moral) good of harmonious happiness)
(2 hits) or nor of harmonious happiness (1 hit). This text is obviously very much
an outlier, as it contains almost as many hits as the entire rest of the corpus
combined, and as the hits are extremely restricted in their linguistic behavior. To
discard them might not seem ideal, but to include them would be even less so.
Again, the search metaphor is instantiated several times in the concordance:
we find [search for NPEMOT ] (lines 2 and 4) and [NPEXP seek NPEMOT ] (lines 3
and 5). They all seem to be from the same text, so similar considerations apply
414
11.2 Case studies
1 ove . Dr . Patwardhan expresses both her [happiness] at seeing the growth of Anandgram and he
2 with mature understanding the search for [happiness] of an actress . The Shyam Benegal and Bl
3 nd lives on her earnings makes Usha seek [happiness] elsewhere . The search for happiness of
4 eek happiness elsewhere . The search for [happiness] of this intensely sensitive girl leads h
5 are seeking something , seeking peace , [happiness] , seeking a nobler way of life , seeking
6 e occasion , the Buddha himself bringing [happiness] to a doomed city , and accordingly , the
7 their suffering and obtain security and [happiness] is by seeking to change and transform so
8 e and notoriety , censure and praise and [happiness] and misery . Just as the stalk gives bir
9 tute , consoled the stricken and brought [happiness] to the miserable . He did not run away f
10 th the physical body is another name for [happiness] . Finally , when the mind is stilled ( i
11 d through such purity of mind to achieve [happiness] . It also says that if one acts or speak
12 or control of mind which is conducive to [happiness] because it flits and floats all over and
13 ual cooperation , the key to success and [happiness] , are at a discount . Even people who ar
14 edas there are many prayers for wealth , [happiness] and glory . " We call on Thee for prospe
15 from sin and full of wealth , leading to [happiness] day by day . " ( Rig ) " May I be glorio
16 I am confident that I can sing to bring [happiness] to my listeners and fulfilment to myself
17 se wanderers together again and there is [happiness] . When they all return to Jaipur they di
18 his old position . Thus there is double [happiness] for all . This plot will give an idea of
19 s possible at the cost of the people ' s [happiness] . The freedom fighter for India against
20 is this ennobling vision of the world of [happiness] and contentment which I have always born
21 eace alone there is human fulfilment and [happiness] . But even if the goal appears distant ,
22 with the peace and prosperity , life and [happiness] of the society ? The only answer to this
23 of you . I wish you every prosperity and [happiness] in the coming years . I have served you
24 ull of vegetation , trees , orchards and [happiness] . But she could not do that for she was
25 ould make her if you did . Learn to give [happiness] to people , all you modern children are
26 didn ' t wish to come in the way of our [happiness] , at which she , my wife , pretended to
27 hree children . They did not know of the [happiness] we shared -- the exciting excursions , t
28 for breakfast with us . It gave us great [happiness] , though he was not his cheerful old sel
29 t more could she desire ? The goddess of [happiness] and mirth had visited her . Forty-five m
30 ure of health . Nobody was bursting with [happiness] ; there was no expectation of sharing in
31 pronounced my son as completely cured . [Happiness] flooded my heart . Silently I held my wi
32 l . He was filled with a kind of childsh [happiness] . He wanted to scamper over the rocks ,
33 said Joan . Janaki ' s face beamed with [happiness] at the comparison . From that day , Joan
34 had a warm sniff of its steam . But his [happiness] drove him into one of those sudden snooz
35 on ' s ? Have I always placed Dinesh ' s [happiness] above mine ? Am I not selfish and posses
36 me in ample measure for the pleasure and [happiness] my stories and novels have brought into
37 tting together , bit by bit . Moments of [happiness] are such fleeting things . Maybe they al
38 leeting things . Maybe they always are . [Happiness] - - maybe it ' s just the burden of a bi
39 is door had played a stellar role in his [happiness] . Day after day , he had sat there like
40 rtainly put me on the highway to eternal [happiness] but it could do nothing about my immedia
41 rs will travel through life in peace and [happiness] in spite of delays , discomfort and suff
415
11 Metaphor
Corpus
lob kolhapur Total
Type transfer 3 7 10
(5.64) (4.36)
¬transfer 50 34 84
(47.36) (36.64)
Total 53 41 94
416
11.2 Case studies
417
11 Metaphor
Let us ignore the patterns in (4g–i), which occurred only once each in Ste-
fanowitsch (2006c). The rest then form a set of expressions with relatively simple
structures that we can extract using the following queries (shown here for the
noun happiness):
The paper also lists a number of metaphorical patterns that describe an increas-
ing pressure, e.g. [NPEMOT build inside NPEXP ] or an overflowing, e.g. [NPEXP
brim over with NPEMOT ], but let us focus on those patterns that describe a sud-
den failure to contain the substance. These are listed in (6):
We can extract these patterns using the following queries (shown, again, for
happiness):
418
11.2 Case studies
e. [hw="(outburst|burst|eruption|explosion)"] [word="of"%c]
[word="happiness"%c & pos=".*NN.*"]
419
11 Metaphor
Table 11.9: The metaphor emotions are a substance filling the ex-
periencer (BNC)
filling/fullness metaphors
Emotion yes no Total
anger Obs.: 29 Obs.: 3512 3541
Exp.: 21.07 Exp.: 3519.93
𝜒 2: 2.99 𝜒 2: 0.02
desire Obs.: 7 Obs.: 5055 5062
Exp.: 30.12 Exp.: 5031.88
𝜒 2: 17.74 𝜒 2: 0.11
disgust Obs.: 6 Obs.: 604 610
Exp.: 3.63 Exp.: 606.37
𝜒 2: 1.55 𝜒 2: 0.01
fear Obs.: 47 Obs.: 7117 7164
Exp.: 42.62 Exp.: 7121.38
𝜒 2: 0.45 𝜒 2: 0.00
happiness Obs.: 15 Obs.: 1601 1616
Exp.: 9.61 Exp.: 1606.39
𝜒 2: 3.02 𝜒 2: 0.02
pride Obs.: 11 Obs.: 2714 2725
Exp.: 16.21 Exp.: 2708.79
𝜒 2: 1.68 𝜒 2: 0.01
sadness Obs.: 13 Obs.: 738 751
Exp.: 4.47 Exp.: 746.53
𝜒 2: 16.29 𝜒 2: 0.10
shame Obs.: 11 Obs.: 1883 1894
Exp.: 11.27 Exp.: 1882.73
𝜒 2: 0.01 𝜒 2: 0.00
420
11.2 Case studies
bursting/exploding metaphors
Emotion yes no Total
anger Obs.: 42 Obs.: 3499 3541
Exp.: 9.40 Exp.: 3531.60
𝜒 2: 113.12 𝜒 2: 0.30
desire Obs.: 4 Obs.: 5058 5062
Exp.: 13.43 Exp.: 5048.57
𝜒 2: 6.62 𝜒 2: 0.02
disgust Obs.: 1 Obs.: 609 610
Exp.: 1.62 Exp.: 608.38
𝜒 2: 0.24 𝜒 2: 0.00
fear Obs.: 1 Obs.: 7163 7164
Exp.: 19.01 Exp.: 7144.99
𝜒 2: 17.06 𝜒 2: 0.05
happiness Obs.: 2 Obs.: 1614 1616
Exp.: 4.29 Exp.: 1611.71
𝜒 2: 1.22 𝜒 2: 0.00
pride Obs.: 12 Obs.: 2713 2725
Exp.: 7.23 Exp.: 2717.77
𝜒 2: 3.14 𝜒 2: 0.01
sadness Obs.: 0 Obs.: 751 751
Exp.: 1.99 Exp.: 749.01
𝜒 2: 1.99 𝜒 2: 0.01
shame Obs.: 0 Obs.: 1894 1894
Exp.: 5.03 Exp.: 1888.97
𝜒 2: 5.03 𝜒 2: 0.01
421
11 Metaphor
422
11.2 Case studies
423
11 Metaphor
1 ts for the year to March . The price has [risen] by 119p since Caradon announced a possi
2 was more than accounted for by a £1.6m [rise] in its Channel 4 subscription and by £
3 aid that the top rate of income tax will [rise] to 50 p.c. on income , after allowances
4 ay . The Pearl Investor Confidence Index [rose] by 1.2 p.c. last month -- its largest
5 December , this figure was said to have [risen] to 17,000 a month at $25,000 ( £14,200
6 by shoppers APPLICATIONS for credit have [risen] sharply in the wake of the Tory 's elec
7 quarter of the year , despite a 15 p.c. [rise] in sales to $373m ( £214m ) . City : Q
8 rman of United Biscuits , saw his salary [rise] from £233,000 to £425,000 last year .
9 nce of the quality food retailers with a [rise] of almost 25% . Liz Dolan 's Surrey bui
10 s . They gave themselves an average 1992 [rise] of 5% , says the IoD , and a lot got ev
11 75p , while Fuller Smith , with profits [rising] to £3.75m ( £3.6m ) at midway , climb
12 LJ from SmithKline Beecham helped shares [rise] 23p to 248p . Merrydown is financing th
13 op up at Wessex WESSEX Water saw profits [rise] 11.3% to £44.3m in the first half help
14 s interims tomorrow , with a 10% profits [rise] to £101m expected . Northumbrian Water
15 es added 2p to 270p after a 10% dividend [rise] to 3.3p . Property divi slide GREAT Por
16 es on the dairy industry . Welcoming the [rise] in retail sales figures for October , C
17 isation of Petroleum Exporting Countries [rose] 100,000 barrels per day to 24.2 million
18 £13 barrier but Glaxo failed to hold a [rise] above £8 . To a degree some of the ris
19 he jobs queue since unemployment started [rising] in April 1990 . Employment Secretary Gi
20 parts of the country , with the biggest [rises] again in London and the SouthEast , fol
First, the concordance corroborates our suspicion that rise is not used literally
in the domain of economics. All 20 hits refer not to vertical motion, but to an
increase in quantity, i.e., they instantiate the metaphor more is up. This is true
both of verbal uses (in lines 1, 3, 4, 5, 6, 8, 11, 12, 13, 17, and 19) and to the nominal
uses in the remaining lines. Second, the nouns in the surrounding context show
424
11.2 Case studies
It is left as an exercise for the reader to test this distribution for significance
using the 𝜒 2 test or a similar test (but use a separate sheet of paper, as the margin
of this page will be too small to contain your calculations). If more such differ-
ences can be found, this might suggest that the metaphor “increase in quantity
is upward motion” is associated more strongly with spending money than with
making money.
This case study demonstrates that central metaphors for a given target domain
can be identified by applying a keyword analysis to a specialized corpus of texts
from that domain. The case study does not discuss a particular research question,
but obviously, the method is useful in the context of many different research
designs. Of course, it requires specialized corpora for the target domain under
investigation. Such corpora are not available (and in some cases not imaginable)
for all target domains, so the method works better for some target domains (such
as economics) than for others (like emotions).
425
11 Metaphor
there is a wide range of devices that mark non-literal language more or less ex-
plicitly (as in metaphorically/figuratively speaking, picture NP as NP, so to speak/
say):
Wallington et al. (2003) investigate the extent to which these devices, which
they call metaphoricity signals, correlate systematically with the occurrence of
metaphorical expressions in language use. They find no strong correlation, but
as they note, this may well be due to various aspects of their design. First, they
adopt a very broad view of what constitutes a metaphoricity signal, including
expressions like a kind/type/sort of, not so much NP as NP and even prefixes like
super-, mini-, etc. While some or all of these signals may have an affinity to
certain kinds of non-literal language, one would not really consider them to be
metaphoricity signals in the same way as those in (9a–c). Second, they investi-
gate a carefully annotated, but very small corpus. Third, they do not distinguish
between strongly conventionalized metaphors, which are found in almost every
utterance and are thus unlikely to be explicitly signaled, and weakly convention-
alized metaphors, which seems more likely to be signaled explicitly a priori).
More restricted case studies are needed to determine whether the idea of met-
aphoricity signals is, in principle, plausible. Let us look at what is intuitively
the clearest case of such a signal on Wallington et al.’s list: the sentence adver-
bials metaphorically speaking and figuratively speaking. As a control, let us use
the roughly equally frequent sentence adverbial technically speaking, which does
not signal metaphoricity but which can, of course, co-occur with (conventional-
ized) metaphors and which can thus serve as a baseline.
There are 22 cases of technically speaking in the BNC:
(10) a. Do you mind if, technically speaking, I resign rather than you sack me?
(BNC A0F)
b. Technically speaking as long as nobody was hurt, no injuries, no dam-
age to the other vehicle, this is not an accident. (BNC A5Y)
c. [T ]echnically speaking, [...] if you put her out into the road she would
have no roof over her head and we should have to take her in. (BNC
AC7)
426
11.2 Case studies
427
11 Metaphor
Taking a generous view, four of these are part of a clause that arguably con-
tains a metaphor: (10f) uses hold as part of the phrase hold liable, instantiating
a metaphor like “believing something about someone is holding them” (cf. also
hold s.o. responsible/accountable, hold in high esteem); (10h) uses the verb evolve
metaphorically to refer to a non-evolutionary development and then uses the
spatial expressions towards and opposite direction metaphorically to describe the
quality of the development; (10r) uses provide as part of the phrase provide em-
ployment, which instantiates a metaphor like “causing someone to be in a state is
transferring an object to them” (cf. also provide s.o. with an opportunity/insight/
power...); (10t) contains the spatial preposition off as part of the phrase off duty,
which could be said to instantiate the metaphor “a situation is a location”. Note
that all four expressions involve highly conventionalized metaphors, that would
hardly be noticed as such by speakers.
There are 7 hits for the sentence adverbial metaphorically speaking in the BNC:
(11) a. A convicted mass murderer has, for the second time, bloodied the nose,
metaphorically speaking, of Malcolm Rifkind, the Secretary of State for
Scotland, by successfully pursuing a claim for damages. (BNC A3G)
b. Yet, when I was seven years old, I should have thought him a very silly
little boy indeed not to have understood about metaphorically speak-
ing, even if he had never heard of it, and it does seem that what he pos-
sessed in the way of scientific approach he lacked in common sense.
(BNC AC7)
c. Good caddies have good temperaments. Just watch Ian Wright getting
a lambasting from Seve Ballesteros and see if Ian ever answers back,
or, indeed, reacts in any way other than to quietly stand and take it on
the chin, metaphorically speaking of course. (BNC ASA)
d. Family [are] a safe investment, but in love you can make a killing
overnight. Metaphorically speaking, I hasten to add. (BNC BMR)
e. Metaphorically speaking, the research front is a frozen moment in time
[...]. (BNC HPN)
f. Gregory put the boot in... metaphorically speaking! (BNC K25)
g. Mr Allenby are you ready to burst into song? Metaphorically speaking.
(BNC KM7)
In clear contrast to technically speaking, six of these seven hits occur in clauses
that contain a metaphor: bloody the nose of sb in (11a) means ‘be successful in
court against sb’, instantiating the metaphor legal fight is physical fight; take
428
11.2 Case studies
it on the chin in (11c) means ‘endure being criticized’, instantiating the metaphor
argument is physical fight; make a killing in (11d) means ‘be financially suc-
cessful’, instantiating the metaphor commercial activity is a hunt; a frozen
moment in time in (11e) means ‘documentation of a particular state’, instantiat-
ing the metaphor time is a flowing body of water; put the boot in in (11f)
means ‘treat sb cruelly’, instantiating the metaphor life (or sports) is physical
fight; burst into song in (11g) means “take one’s turn speaking”, instantiating
the metaphor speaking is singing. The only exception is (11b); this is a meta-
linguistic use, indicating that someone did not understand that an utterance was
meant metaphorically, rather than marking an utterance as metaphorical.
There are 13 hits for figuratively speaking in the BNC:
(12) a. The darts, the lumps of poison and the raw materials from which it is
extracted all provide a challenge for others with a taste (figuratively
speaking) for excitement. (BNC AC9)
b. Alternatively, you could select spiky, upright plants like agaves or yuc-
cas to transport you across the world, figuratively speaking, to the great
deserts of North America. (BNC ACX)
c. Palladium, statue of the goddess Pallas (Minerva) at Troy on which the
city’s safety was said to depend, hence, figuratively speaking, the Bar
seen as a bulwark of society. (BNC B0Y)
d. Figuratively speaking, who would not give their right arm to find such
a love? (BNC B21)
e. [I]t is surprising to me that this process was ever permitted on this site
at all (being figuratively speaking within arms length of the dwellings).
(BNC B2D)
f. Figuratively speaking, we also make the law of value serve our aims.
(BNC BMA)
g. This schlocky international movie, photographed in eye-straining col-
our, cashing in (figuratively speaking) on the craze for James Bond pic-
tures [...]. (BNC C9U)
h. He said: ‘I’m not sure if the princess held a gun to Charles’s head, fig-
uratively speaking, but it seems if she wanted something said.’ (BNC
CBF)
i. Let’s pick someone completely at random, now we’ve had Tracey fig-
uratively speaking! (BNC F7U)
429
11 Metaphor
430
11.2 Case studies
the latter are a metaphoricity signal, they should occur significantly more fre-
quently in metaphorical contexts than the former. Table 11.12 shows the tabulated
results from the discussion above, subsuming metaphors and metonymies under
figurative. The expected difference between contexts is clearly there, and sta-
tistically highly significant (𝜒 2 = 21.66, df = 1, 𝑝 < 0.001, 𝜙 = 0.7182).
Table 11.12: Literal and figurative utterances containing the sentence
adverbials metaphorically/figuratively speaking and technically speak-
ing (BNC)
Sentence adverbial
met./fig. technically Total
Utterance figurative 18 4 22
(10.48) (11.52)
¬figurative 2 18 20
(9.52) (10.48)
Total 20 22 42
Of course, the question remains, why some metaphors should be explicitly sig-
naled while the majority is not. For example, we might suspect that metaphor-
ical expressions are more likely to be explicitly signaled in contexts in which
they might be interpreted literally. This may be the case for put the boot in in
(11f), which occurs in a description of a rugby game where one could potentially
misread it for a statement that someone was actually kicked. Alternatively (or ad-
ditionally), a metaphor may be signaled explicitly if its specific phrasing is more
likely to be used in literal contexts. This may be the case with hold a gun to sb’s
head in (12f): there are ten hits for this phrase in the BNC, only one of which is
metaphorical. Again, which of these hypotheses (if any of them) is correct would
have to be studied more systematically.
This case study found a clear effect where the authors of the study it is based on
did not. This demonstrates the need to formulate specific predictions concerning
the behavior of specific linguistic items in such a way that they can be tested
systematically and the results be evaluated statistically. The study also shows
that the area of metaphoricity signals is worthy of further investigation.
431
11 Metaphor
432
11.2 Case studies
telegraph (𝑛 = 68)
guardian (𝑛 = 136)
433
11 Metaphor
should be, so this remains a promising area of research). The case study also
demonstrates the need to include a control sample in corpus-linguistic designs
(in case that this still needed to be demonstrated at this point).
Newspaper
guardian ¬guardian Total
Pattern liquid met. 10 11 21
(14.00) (7.00)
¬liquid met. 126 57 183
(122.00) (61.00)
Table 136 68 204
11.2.4 Metonymy
11.2.4.1 Case study: Subjects of the verb bomb
This chapter was concerned with metaphor, but touched upon metonymy in Case
Study 11.2.3.2. While metaphor and metonymy are different phenomena, they are
related by virtue of the fact that both of them are cases of non-literal language,
and they tend to be of interest to the same groups of researchers, so let us finish
the chapter with a short case study of metonymy, if only to see to what extent
the methods introduced above can be transferred to this phenomenon.
Following Lakoff & Johnson (1980: 35), metonymy is defined in a broad sense
here as “using one entity to refer to another that is related to it” (this includes
what is often called synecdoche, see Seto (1999) for critical discussion). Text book
examples are the following from Lakoff & Johnson (1980: 35, 39):
In (13a), the metonym ham sandwich stands for the target expression ‘the per-
son who ordered the ham sandwich’, in (13b) the metonym Nixon stands for
the target expression ‘the air-force pilots controlled by Nixon’ (at least at first
glance).
434
11.2 Case studies
Thus, metonymy differs from metaphor in that it does not mix vocabulary from
two domains, which has consequences for a transfer of the methods introduced
for the study of metaphor in Section 11.1.
The source-domain oriented approach can be transferred relatively straightfor-
wardly – we can query an item (or set of items) that we suspect may be used as
metonyms then identify the actual metonymic uses. The main difficulty with this
approach is choosing promising items for investigation. For example, the word
sandwich occurs almost 900 times in the BNC, but unless I have overlooked one,
it is not used as a metonym even once.
A straightforward analogue to the target-domain oriented approach (i.e., meta-
phorical pattern analysis) is more difficult to devise, as metonymies do not com-
bine vocabulary from different semantic domains. One possibility would be to
search for verbs that we know or suspect to be used with metonymic subjects
and/or objects. For example, a Google search for ⟨"is waiting for (his|her|
their|the) check"⟩ turns up about 20 unique hits; most of these have people as
subjects and none of them have meals as subjects, but there are three cases that
have table as subject, as in (14):
Let us focus on the source-domain oriented perspective here, and let us use the
famous example sentence in (13) as a starting point, loosely replicating the study
in Stefanowitsch (2015). According to Lakoff and Johnson, this sentence instanti-
ates what they call the “controller for controlled” metonymy, i.e. Nixon would be
a metonym for the air force pilots controlled by Nixon.1 Thus, searching a corpus
for sequences of a noun followed by the verb bomb should allow us to asses, for
example, the importance of this metonymy in relation to other metonymies and
literal uses.
Querying the BNC for ⟨[pos=".*NN.*"] [lemma="bomb" & pos=".*VB.*"]⟩
yields 31 hits referring to the dropping of bombs. Of these, only a single one
has the ultimate decision maker as a subject (cf. 15a). Somewhat more frequent
in subject position are countries or inhabitants of countries (5 cases) (cf. 15b, c).
Even more frequently, the organization responsible for carrying out the bombing
– e.g. an air force, or part of an air force – is chosen as the subject (9 cases) (cf.
15d, e). The most frequent case (14 hits) mentions the aircraft carrying the bombs
in subject position, often accompanied by an adjective referring to the country
whose military operates the planes (cf 15f) or some other responsible group (cf.
1
Alternatively, as argued by Stallard (1993), it is the predicate rather than the subject that is
used metonymically in this sentence, which would make this a metonym-oriented case study.
435
11 Metaphor
15g). Finally, there are two cases where the bombs themselves occupy the subject
position (cf. 15h).
Cases with pronouns in subject position – resulting from the query ⟨ [pos=".*
PNP.*"] [lemma="bomb" & pos=".*VB.*"] ⟩ – have a similar distribution, again,
there is only one hit with a human controller in subject position. All hits (whether
with pronouns, common nouns or proper names), interestingly, have metonymic
subjects – i.e., not a single example has the bomber pilot in the subject posi-
tion. This is unexpected, since literal uses should be more frequent than figura-
tive uses (it leads Stefanowitsch (2015) to reject an analysis of such sentences as
metonymies altogether). On the other hand, there are cases that are plausibly an-
alyzed as metonymies here, such as examples (15d–e), which seem to instantiate
a metonymy like military unit for member of unit (i.e. whole for part) and
(15f–h), which instantiate plane for pilot (i.e. instrument for controller).
More systematic study of such metonymies by target domain could uncover
more such facts as well as contributing to a general picture of how important
particular metonymies are in a particular language.
This case study sketches a potential target-oriented approach to the corpus-
based study of metonymy, along with some general questions that we might in-
vestigate using it (most obviously, the question of how central a given metonymy
is in the language under investigation). Again, metonymy is a vastly under-re-
searched area in corpus linguistics, so much work remains to be done.
436
12 Epilogue
In this book, I have focused on corpus linguistics as a methodology, more pre-
cisely, as an application of a general observational scientific procedure to large
samples of linguistic usage. I have refrained from placing this method in a par-
ticular theoretical framework for two reasons.
The first reason is that I am not convinced that linguistics should be focusing
quite as much on theoretical frameworks, but rather on linguistic description
based on data. Edward Sapir famously said that “unfortunately, or luckily, no
language is tyrannically consistent. All grammars leak” (Sapir 1921: 39). This is
all the more true of formal models, which have a tendency to attempt to achieve
tyrannical consistency by pretending those leaks do not exist or, if they do exist,
are someone else’s problem. To me, and to many others whose studies I discussed
in this book, the ways grammars leak are simply more interesting than the for-
malisms that help us ignore these leaks.
The second reason is that I believe that corpus linguistics has a place in any
theoretical linguistic framework, as long as that framework has some commit-
ment to modeling linguistic reality. Obviously, the precise place, or rather, the
distance from the data analyzed using this method and the consequences of this
analysis for the model depend on the kind of linguistic reality that is being mod-
eled. If it is language use, as it usually is in historically or sociolinguistically
oriented studies, the distance is relatively short, requiring the researcher to dis-
cover the systematicity behind the usage patterns observed in the data. If it is the
mental representation of language, the length of the distance depends on your
assumptions about those representations.
Traditionally, those representations have been argued to be something fun-
damentally different from linguistic usage. It has been claimed that they are an
ephemeral “competence” based on a “universal” grammar. There is disagreement
as to the nature of this universal grammar – some claim that is a “mental organ”
(Chomsky 1980), some imagine it as an evolved biological instinct (Pinker 1994).
But all proponents of a universal grammar are certain that mental representa-
tions of language are dependent on and responsible for linguistic usage only in
the most indirect ways imaginable, making corpora largely useless for the study
of language. As I have argued in Chapters 1 and 2, the only methodological al-
ternative to corpus data that proponents of this view offer – i.e. introspective
12 Epilogue
grammaticality judgments – suffer from all the same problems as corpus data,
without offering any of the advantages.
However, more recent models do not draw as strict a line between usage and
mental representations. The Usage-Based Model (Langacker 1991) is a model of
linguistic knowledge based on the assumption that speakers initially learn lan-
guage as a set of unanalyzed chunks of various sizes (“established units”), from
which they derive linguistic representations of varying degrees of abstractness
and complexity based on formal and semantic correspondences across these units
(cf. Langacker 1991: 266f). The Emergent Grammar model is based on similar as-
sumptions but eschews abstractness altogether, viewing language as “built up
out of combinations of [...] prefabricated parts”, as “a kind of pastiche, pasted
together in an improvised way out of ready-made elements” (Hopper 1987: 144).
In these models, the corpus becomes more than just a research tool, it becomes
an integral part of a model of linguistic competence (cf. Stefanowitsch 2011). This
view is most radically expressed in the notion of “lexical priming” developed in
Hoey (2005), in which linguistic competence is seen as a mental concordance
over linguistic experience:
The notion of priming as here outlined assumes that the mind has a men-
tal concordance of every word it has encountered, a concordance that has
been richly glossed for social, physical, discoursal, generic and interper-
sonal context. This mental concordance is accessible and can be processed
in much the same way that a computer concordance is, so that all kinds
of patterns, including collocational patterns, are available for use. It simul-
taneously serves as a part, at least, of our knowledge base. (Hoey 2005:
11)
Obviously, this mental concordance would not correspond exactly to any con-
cordance derived form an actual linguistic corpus. First, because – as discussed
in Chapters 1 and 2 – no linguistic corpus captures the linguistic experience of
a given individual speaker or the “average” speaker in a speech community; sec-
ond, because the concordance that Hoey envisions is not a concordance of lin-
guistic forms but of contextualized linguistic signs – it contains all the semantic
and pragmatic information that corpus linguists have to reconstruct laboriously
in their analyses. Still, an appropriately annotated concordance from a balanced
corpus would be a reasonable operationalization of this mental concordance (cf.
also Taylor 2012).
In less radical usage-based models of language, such as Langacker’s, the corpus
is not a model of linguistic competence – the latter is seen as a consequence
438
of linguistic input perceived and organized by human minds with a particular
structure (such as the capacity for figure-ground categorization). The corpus is,
however, a reasonable model (or at least an operationalization) of this linguistic
input. Many of the properties of language that guide the storage of units and
the abstraction of schemas over these stored units can be derived from corpora
– frequencies, associations between units of linguistic structure, distributions of
these units across grammatical and textual contexts, the internal variability of
these units, etc. (cf. Stefanowitsch & Flach 2016 for discussion).
This view is explicitly taken in language acquisition research conducted within
the Usage-Based Model (e.g. Tomasello 2003, cf. also Dabrowska 2001, Diessel
2004), where children’s expanding grammatical abilities, as reflected in their lin-
guistic output, are investigated against the input they get from their caretakers as
recorded in large corpora of caretaker-child interactions. The view of the corpus
as a model of linguistic input is less explicit in the work of the major theoreti-
cal proponents of the Usage-Based Model, who connect the notion of usage to
the notion of linguistic corpora only in theory. However, it is a view that offers
a tremendous potential to bring together two broad strands of research – cog-
nitive-functional linguistics (including some versions of construction grammar)
and corpus linguistics (including attempts to build theoretical models on corpus
data, such as pattern grammar (Hunston & Francis 2000) and Lexical Priming
(Hoey 2005)). These strands have developed more or less independently and their
proponents are sometimes mildly hostile toward each other over small but fun-
damental differences in perspective (see McEnery & Hardie 2012, Section 8.3 for
discussion). If they could overcome these differences, they could complement
each other in many ways, cognitive linguistics providing a more explicitly psy-
chological framework than most corpus linguists adopt, and corpus linguistics
providing a methodology that cognitive linguists serious about usage urgently
need.
Finally, in usage-based models as well as in models of language in general,
corpora can be treated as models (or operationalizations) of the typical linguistic
output of the members of a speech community, i.e. the language produced based
on their internalized linguistic knowledge. This is the least controversial view,
and the one that I have essentially adopted throughout this book. Even under
this view, corpus data remain one of the best sources of linguistic data we have –
one that can only keep growing, providing us with ever deeper insights into the
leaky, intricate, ever-changing signature activity of our species.
I hope this book has inspired you and I hope it will help you produce research
that inspires all of us.
439
13 Study notes
Study notes to Chapter 1
Resources
1. The British National Corpus (BNC) is available for download free of charge
from the Oxford Text Archive at http://ota.ox.ac.uk/desc/2554.
Further reading
Although it may seem somewhat dated, one of the best discussions of what ex-
actly “language” is or can be is Lyons (1981).
4. The Susanne Corpus is available with some restrictions from the Oxford
Text Archive at http://purl.ox.ac.uk/ota/1708.
13 Study notes
Further reading
Wynne (2005) is a brief but essential freely available introduction to all aspects
of corpus development, including issues of annotation; Xiao (2008) is a compact
overview of well-known English corpora.
442
– The Collins Dictionary (fomerly Collins COBUILD Advanced Dictio-
nary), https://www.collinsdictionary.com/
– Longman Dictionary of Contemporary English (LDCE), https://www.
ldoceonline.com
– Merriam-Webster (MW), https://www.merriam-webster.com
– Various Oxford dictionaries, including the Oxford Advanced Learn-
ers Dictionary (OALD), https://www.oxfordlearnersdictionaries.com
Further reading
A readable exposition of Popper’s ideas about falsification is his essay “Science as
falsification”, included in the collection Conjectures and Refutations (Popper 1963).
A discussion of the role of operationalization in the context of corpus-based se-
mantics is found in Stefanowitsch (2010), Wulff (2003) is a study of adjective order
in English that operationalizes a variety of linguistic constructs in an exemplary
and very transparent way. Zaenen et al. (2004) is an example of a detailed and
extensive coding scheme for animacy.
2. The IMS Open Corpus Work Bench (CWB) is a available for download free
of charge at http://cwb.sourceforge.net/, it can be installed under all unix-
like operating systems (including Linux and Mac OS X).
443
13 Study notes
Further reading
No matter what corpora and concordancing software you work with, you will
need regular expressions at some point. Information is easy to find online, I rec-
ommend the Wikipedia Page as a starting point (Wikipedia contributors 2018).
An excellent introduction to issues involved in annotating corpora is found in Ge-
offrey Leech’s contribution “Adding linguistic annotation” in Wynne (2005). An
insightful case study on working with texts in non-standardized orthographies
is found in Barnbrook (1996) (which is by now seriously dated in many respects,
but still a worthwhile read).
Further reading
Anyone serious about using statistics in their research should start with a basic
introduction to statistics, and then proceed to an introduction of more advanced
methods, preferably one that introduces a statistical software package at the same
time. For the first step, I recommend Butler (1985), a very solid introduction to sta-
tistical concepts and pitfalls specifically aimed at linguists. It is out of print, but
the author made it available for free at https://web.archive.org/web/2006052306
1748/http://uwe.ac.uk/hlss/llas/statistics-in-linguistics/bkindex.shtml. For the sec-
ond step, I recommend Gries (2013) as a package deal geared specifically towards
linguistic research questions, but I also encourage you to explore the wide range
of free or commercially available books introducing statistics with R.
444
Study notes to Chapter 7
If you want to learn more about association measures, Evert (2005) and the com-
panion website at http://www.collocations.de/AM/ are very comprehensive and
relatively accessible places to start. Stefanowitsch & Flach (2016) discuss corpus-
based association measures in the context of psycholinguistics.
Further reading
Grammar is a complex phenomenon investigated from very different perspec-
tives. This makes general suggestions for further reading difficult. It may be best
to start with collections focusing on the corpus-based analysis of grammar, such
as Rohdenburg & Mondorf (2003), Gries & Stefanowitsch (2006), Rohdenburg &
Schlüter (2009) or Lindquist & Mair (2004).
445
13 Study notes
2. The 𝑛-gram data from the Google Books archive is available for down-
load free of charge at http://storage.googleapis.com/books/ngrams/books/
datasetsv2.html (note that the files are extremely large).
Further reading
This chapter has focused on very simple aspects of variation across text types
and a very simple notion of “text type”. Biber (1988) and Biber (1989) are good
starting points for a more comprehensive corpus-based perspective on text types.
As seen in some of the case studies in this chapter, text is frequently a proxy for
demographic properties of the speakers who have produced it, making corpus
linguistics a variant of sociolinguistics, see further Baker (2010a).
446
14 Statistical tables
14.1 Critical values for the chi-square test
1. Find the appropriate row for the degrees of freedom of your data.
2. Find the most rightward column listing a value smaller than the chi-square
value you have calculated. At the top, it will tell you the corresponding
probability of error.
2. Find the most rightward column listing a value smaller than the chi-square
value you have calculated. At the top, it will tell you the corresponding
probability of error.
448
14.3 Critical values for the Mann-Whitney-Text
𝑛
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
𝑚 1
2 0 0 0 0 1 1 1 1 1 2 2 2 2 3 3 3 3 3
3 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10
4 0 1 2 3 4 4 5 6 7 8 9 10 11 11 12 13 14 15 16 17 17 18
5 2 3 5 6 7 8 9 11 12 13 14 15 17 18 19 20 22 23 24 25 27
6 5 6 8 10 11 13 14 16 17 19 21 22 24 25 27 29 30 32 33 35
7 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44
8 13 15 17 19 22 24 26 29 31 34 36 38 41 43 45 48 50 53
9 17 20 23 26 28 31 34 37 39 42 45 48 50 53 56 59 62
10 23 26 29 33 36 39 42 45 48 52 55 58 61 64 67 71
11 30 33 37 40 44 47 51 55 58 62 65 69 73 76 80
12 37 41 45 49 53 57 61 65 69 73 77 81 85 89
13 45 50 54 59 63 67 72 76 80 85 89 94 98
14 55 59 64 69 74 78 83 88 93 98 102 107
15 64 70 75 80 85 90 96 101 106 111 117
16 75 81 86 92 98 103 109 115 120 126
17 87 93 99 105 111 117 123 129 135
18 99 106 112 119 125 132 138 145
19 113 119 126 133 140 147 154
20 127 134 141 149 156 163
21 142 150 157 165 173
22 158 166 174 182
23 175 183 192
24 192 201
25 211
449
14 Statistical tables
𝑛
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1
2 0 0 0 0 0 0 0
3 0 0 0 1 1 1 2 2 2 2 3 3 3 4 4 4 5
4 0 0 1 1 2 2 3 3 4 5 5 6 6 7 8 8 9 9 10 10
5 0 1 1 2 3 4 5 6 7 7 8 9 10 11 12 13 14 14 15 16 17
6 2 3 4 5 6 7 9 10 11 12 13 15 16 17 18 19 21 22 23 24
7 4 6 7 9 10 12 13 15 16 18 19 21 22 24 25 27 29 30 32
8 7 9 11 13 15 17 18 20 22 24 26 28 30 32 34 35 37 39
9 11 13 16 18 20 22 24 27 29 31 33 36 38 40 43 45 47
10 16 18 21 24 26 29 31 34 37 39 42 44 47 50 52 55
11 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63
12 27 31 34 37 41 44 47 51 54 58 61 64 68 71
13 34 38 42 45 49 53 57 60 64 68 72 75 79
14 42 46 50 54 58 63 67 71 75 79 83 87
15 51 55 60 64 69 73 78 82 87 91 96
16 60 65 70 74 79 84 89 94 99 104
17 70 75 81 86 91 96 102 107 112
18 81 87 92 98 104 109 115 121
19 93 99 105 111 117 123 129
20 105 112 118 125 131 138
21 118 125 132 139 146
22 133 140 147 155
23 148 155 163
24 164 172
25 180
𝑛
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
𝑚 1
2
3 0 0 0 0 0
4 0 0 0 1 1 1 2 2 2 3 3 3 3
5 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8
6 0 1 2 2 3 4 5 5 6 7 8 8 9 10 11 12 12 13
7 0 1 2 3 4 5 6 7 8 9 10 11 13 14 15 16 17 18 19
8 2 4 5 6 7 9 10 11 13 14 15 17 18 20 21 22 24 25
9 5 7 8 10 11 13 15 16 18 20 21 23 25 26 28 30 32
10 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38
11 12 15 17 19 21 24 26 28 31 33 35 38 40 42 45
12 17 20 22 25 27 30 33 35 38 41 44 46 49 52
13 23 25 28 31 34 37 40 43 46 49 52 56 59
14 29 32 35 39 42 45 49 52 55 59 62 66
15 36 39 43 46 50 54 58 61 65 69 73
16 43 47 51 55 59 63 67 72 76 80
17 51 56 60 65 69 74 78 83 87
18 61 65 70 75 80 85 89 94
19 70 76 81 86 91 96 102
20 81 87 92 98 103 109
21 93 98 104 110 116
22 105 111 117 124
23 118 124 131
24 132 139
25 146
450
14.4 Critical values for Welch’s 𝑡 -Test
1. Find the appropriate row for the degrees of freedom of your test.
2. Find the most rightward column whose value is smaller than your 𝑡-value.
At the top of the column you will find the 𝑝-value you should report.
451
References
Altenberg, Bengt. 1980. Binominal NP’s in a thematic perspective: Genitive vs.
of–constructions in 17th century English. In Sven Jacobsen (ed.), Papers from
the Scandinavian Symposium on Syntactic Variation (Stockholm Studies in En-
glish 52), 149–172. Stockholm: Almqvist & Wiksell.
APA, American Psychiatric Association. 2000. Diagnostic and statistical manual
of mental disorders: DSM-IV-TR. 4th ed., text revision. Washington, DC: Amer-
ican Psychiatric Association.
Aston, Guy & Lou Burnard. 1998. The BNC handbook: Exploring the British Na-
tional Corpus with SARA. Edinburgh: Edinburgh University Press.
Baayen, Harald. 2008. Analyzing linguistic data: A practical introduction. Cam-
bridge & New York: Cambridge University Press.
Baayen, Harald. 2009. Corpus linguistics in morphology: Morphological produc-
tivity. In Anke Lüdeling & Merja Kytö (eds.), Corpus linguistics: An interna-
tional handbook, vol. 2 (Handbooks of Linguistics and Communication Science
29), 899–919. Berlin & New York: De Gruyter Mouton.
Baker, Carolyn D. & Peter Freebody. 1989. Children’s first school books: Introduc-
tions to the culture of literacy. Oxford & Cambridge, MA: Blackwell.
Baker, Paul. 2010a. Sociolinguistics and corpus linguistics (Edinburgh sociolinguis-
tics). Edinburgh: Edinburgh University Press.
Baker, Paul. 2010b. Will Ms ever be as frequent as Mr? A corpus-based compari-
son of gendered terms across four diachronic corpora of British English. Gen-
der and Language 4(1). 125–149. DOI:10.1558/genl.v4i1.125
Barcelona Sánchez, Antonio. 1995. Metaphorical models of romantic love in
Romeo and Juliet. Journal of Pragmatics 24(6). 667–688. DOI:10.1016/0378-
2166(95)00007-F
Barnbrook, Geoff. 1996. Language and computers: A practical introduction to the
computer analysis of language (Edinburgh textbooks in empirical linguistics).
Edinburgh: Edinburgh University Press.
Batygin, Konstantin & Michael E. Brown. 2016. Evidence for a distant giant planet
in the solar system. The Astronomical Journal 151(2). 22. DOI:10.3847/0004-
6256/151/2/22
References
454
Caldas-Coulthard, Carmen Rosa & Rosemary Moon. 2010. ‘Curvy, hunky, kinky’:
Using corpora as tools for critical analysis. Discourse & Society 21(2). 99–133.
DOI:10.1177/0957926509353843
Charles, Walter G. & George A. Miller. 1989. Contexts of antonymous adjectives.
Applied Psycholinguistics 10(03). 357. DOI:10.1017/S0142716400008675
Charteris-Black, Jonathan. 2004. Corpus approaches to critical metaphor analysis.
Houndmills & New York: Palgrave Macmillan.
Charteris-Black, Jonathan. 2005. Politicians and rhetoric: The persuasive power of
metaphor. Houndmills & New York: Palgrave Macmillan.
Charteris-Black, Jonathan. 2006. Britain as a container: Immigration
metaphors in the 2005 election campaign. Discourse & Society 17(5). 563–581.
DOI:10.1177/0957926506066345
Chen, Ping. 1986. Discourse and particle movement in English. Studies in Lan-
guage 10(1). 79–95. DOI:10.1075/sl.10.1.05che
Cheng, Winnie. 2012. Exploring corpus linguistics: Language in action. London &
New York: Routledge.
Chomsky, Noam. 1957. Syntactic structures. The Hague: Mouton.
Chomsky, Noam. 1964. [The development of grammar in child language]: Discus-
sion. Monographs of the Society for Research in Child Development 29(1). 35–42.
DOI:10.2307/1165753
Chomsky, Noam. 1972. Language and mind. Second edition. New York: Harcourt
Brace Jovanovich.
Chomsky, Noam. 1980. Rules and representations. New York: Columbia University
Press.
Christmann, Ursula, Christoph Mischo & Norbert Groeben. 2000. Components of
the evaluation of integrity violations in argumentative discussions: Relevant
factors and their relationships. Journal of Language and Social Psychology 19(3).
315–341. DOI:10.1177/0261927X00019003003
Church, Kenneth Ward, William Gale, Patrick Hanks & Donald Hindle. 1991. Us-
ing statistics in lexical analysis. In Uri Zernik (ed.), Lexical acquisition: Exploit-
ing on-line resources to build a lexicon, 115–164. Hillsdale, NJ: Lawrence Erlbaum
Associates.
Church, Kenneth Ward & Patrick Hanks. 1990. Word association norms, mutual
information, and lexicography. Computational Linguistics 16(1). 22–29.
Cohen, Jacob. 1960. A coefficient of agreement for nominal scales. Educational
and Psychological Measurement 20(1). 37–46.
Colleman, Timothy. 2006. De Nederlandse datiefalternantie: Een constructioneel en
corpusgebaseerd onderzoek. Ghent: Ghent University. (Ph.D. thesis).
455
References
456
Deignan, Alice. 2006. The grammar of linguistic metaphors. In Anatol Stefanow-
itsch & Stefan Th. Gries (eds.), Corpus-based approaches to metaphor and
metonymy (Trends in Linguistics. Studies and Monographs [TiLSM]), 106–122.
Berlin & New York: Mouton de Gruyter. DOI:10.1515/9783110199895.106
Dewey, John. 1910. How we think. Boston, MA: D. C. Heath & Company.
Diessel, Holger. 2004. The acquisition of complex sentences (Cambridge studies in
linguistics 105). Cambridge & New York: Cambridge University Press.
Dı́az-Vera, Javier E. 2015. Love in the time of the corpora: Preferential conceptual-
izations of love in world Englishes. In Vito Pirrelli, Claudia Marzi & Marcello
Ferro (eds.), Word structure and word usage, 161–165. Pisa: CEUR Workshop
Procedings.
Dı́az-Vera, Javier E. & Rosario Caballero. 2013. Exploring the feeling-emotions
continuum across cultures: Jealousy in English and Spanish. Intercultural Prag-
matics 10(2). 265–294. DOI:10.1515/ip-2013-0012
Dunning, Ted. 1993. Accurate methods for the statistics of surprise and soinci-
dence. Computational Linguistics 19(1). 61–74.
Emons, Rudolf. 1997. Corpus linguistics: some basic problems. Studia Anglica Pos-
naniensia XXXIL. 61–68.
Ericsson, Göran. & Thomas A. Heberlein. 2002. “Jägare talar naturens språk”
(Hunters speak nature’s language): A comparison of outdoor activities and
attitudes toward wildlife among Swedish hunters and the general public.
Zeitschrift für Jagdwissenschaft 48(S1). 301–308. DOI:10.1007/BF02192422
Evert, Stefan. 2005. The statistics of word cooccurrences: Word pairs and collo-
cations. Stuttgart: Institut für maschinelle Sprachverarbeitung, University of
Stuttgart. (Ph.D. thesis). http : / / elib . uni - stuttgart . de / opus / volltexte / 2005 /
2371/.
Evert, Stefan & Andrew Hardie. 2011. Twenty-first century Corpus Workbench:
Updating a query architecture for the new millennium. In Proceedings of the
Corpus Linguistics 2011 Conference. Birmingham: University of Birmingham.
Fellbaum, Christiane. 1995. Co-occurrence and antonymy. International Journal
of Lexicography 8(4). 281–303. DOI:10.1093/ijl/8.4.281
Fellbaum, Christiane. 1998. WordNet: an electronic lexical database. Cambridge,
MA: The MIT Press.
Fillmore, Charles. 1992. “Corpus linguistics” or “computer-aided armchair linguis-
tics”. In Jan Svartvik (ed.), Directions in corpus linguistics: Proceedings of Nobel
Symposium 82, Stockholm, 4–8 August 1991 (Trends in Linguistics. Studies and
Monographs 65), 35–60. Berlin & New York: Mouton de Gruyter.
Firth, John Rupert. 1957. Papers in Linguistics 1934–1951. London: Oxford Univer-
sity Press.
457
References
458
Goldberg, Adele E. 1995. Constructions: A construction grammar approach to ar-
gument structure (Cognitive theory of language and culture). Chicago: The
University of Chicago Press.
Goschler, Juliana, Till Woerfel, Anatol Stefanowitsch, Heike Wiese & Christoph
Schroeder. 2013. Beyond conflation patterns: The encoding of motion events
in Kiezdeutsch. Yearbook of the German Cognitive Linguistics Association 1(1).
237–254. DOI:10.1515/gcla-2013-0013
Grafmiller, Jason. 2014. Variation in English genitives across modal-
ity and genres. English Language and Linguistics 18(03). 471–496.
DOI:10.1017/S1360674314000136
Green, Lisa J. 2002. African American English: A linguistic introduction. Cam-
bridge & New York: Cambridge University Press.
Greenberg, Steven, Hannah Carvey, Leah Hitchcock & Shuangyu Chang. 2003.
Temporal properties of spontaneous speech: A syllable-centric perspective.
Journal of Phonetics 31(3-4). 465–485. DOI:10.1016/j.wocn.2003.09.005
Greenberg, Steven, Joy Hollenback & Dan Ellis. 1996. Insights into spoken lan-
guage gleaned from phonetic transcription of the Switchboard corpus. In Pro-
ceedings of the Fourth International Conference on Spoken Language Processing,
vol. 1, 32–35. Philadelphia.
Gries, Stefan Th. 2001. A corpus-linguistic analysis of English -ic vs. -ical adjec-
tives. ICAME Journal 25. 65–108.
Gries, Stefan Th. 2002. Evidence in Linguistics: Three approaches to genitives in
English. In Ruth M. Brend & William J. Sullivan (eds.), LACUS Forum XXVIII:
What constitutes evidence in linguistics?, 17–31. Fullerton, CA: LACUS.
Gries, Stefan Th. 2003a. Multifactorial analysis in corpus linguistics: A study of
particle placement (Open linguistics series). New York: Continuum.
Gries, Stefan Th. 2003b. Testing the sub-test: An analysis of English -ic
and -ical adjectives. International Journal of Corpus Linguistics 8(1). 31–61.
DOI:10.1075/ijcl.8.1.02gri
Gries, Stefan Th. 2004. Some characteristics of English morphological blends. In
Mary A. Andronis, Erin Debenport, Anne Pycha & Keiko Yoshimura (eds.),
Papers from the 38th Regional Meeting of the Chicago Linguistics Society. Vol. II:
The Panels, 201–216. Chicago: Chicago Linguistic Society.
Gries, Stefan Th. 2005. Syntactic priming: A corpus-based approach. Journal of
Psycholinguistic Research 34(4). 365–399. DOI:10.1007/s10936-005-6139-3
Gries, Stefan Th. 2013. Statistics for linguistics with R: A practical introduction. 2nd
revised edition. Berlin: De Gruyter Mouton.
459
References
Gries, Stefan Th. & Martin Hilpert. 2010. Modeling diachronic change in
the third person singular: A multifactorial, verb- and author-specific ex-
ploratory approach. English Language and Linguistics 14(03). 293–320.
DOI:10.1017/S1360674310000092
Gries, Stefan Th. & Naoki Otani. 2010. Behavioral profiles: A corpus-based per-
spective on synonymy and antonymy. ICAME Journal 34. 121–150.
Gries, Stefan Th. & Anatol Stefanowitsch. 2004. Extending collostructional analy-
sis: A corpus-based perspective on “alternations”. International Journal of Cor-
pus Linguistics 9(1). 97–129. DOI:10.1075/ijcl.9.1.06gri
Gries, Stefan Th. & Anatol Stefanowitsch (eds.). 2006. Corpora in cognitive linguis-
tics. Corpus-based approaches to syntax and lexis (Trends in Linguistics. Studies
and Monographs [TiLSM]). Berlin & New York: Mouton de Gruyter.
Güldenring, Barbara Ann. 2017. Emotion metaphors in new Englishes: A
corpus-based study of ANGER. Cognitive Linguistic Studies 4(1). 82–109.
DOI:10.1075/cogls.4.1.05gul
Guz, Wojciech. 2009. English affixal nominalizations across language
registers. Poznań Studies in Contemporary Linguistics 45(4). 461–485.
DOI:10.2478/v10010-009-0030-6
Halliday, Michael A. K. 1961. Categories of the theory of grammar. Word 17. 241–
92.
Herrmann, Konrad. 2011. Hardness testing principles and applications. Materials
Park, Ohio: ASM International.
Heyd, Theresa. 2016. Narratives of belonging in the digital diaspora: Corpus ap-
proaches to a cultural concept. Open Linguistics 2(1). DOI:10.1515/opli-2016-
0013
Hilpert, Martin. 2008. Germanic future constructions: A usage-based approach to
language change (Constructional approaches to language 7). Amsterdam &
Philadelphia: John Benjamins.
Hilpert, Martin. 2015. Constructional change in English: Developments in allomor-
phy, word formation, and syntax. 1st paperback edition (Studies in English lan-
guage). Cambridge: Cambridge University Press. OCLC: 959270152.
Hoey, Michael. 2005. Lexical priming: A new theory of words and language. Lon-
don & New York: Routledge.
Hoffmann, Sebastian. 2004. Using the OED quotations database as a corpus: A
linguistic appraisal. ICAME Journal 28. 17–30.
Hoffmann, Thomas. 2014. The cognitive evolution of Englishes: The role of
constructions in the dynamic model. In Sarah Buschfeld, Thomas Hoffmann,
Magnus Huber & Alexander Kautzsch (eds.), Varieties of English around
460
the world, vol. G49, 160–180. Amsterdam & Philadelphia: John Benjamins.
DOI:10.1075/veaw.g49.10hof
Hopper, Paul. 1987. Emergent grammar. In Proceedings of the Thirteenth Annual
Meeting of the Berkeley Linguistics Society, 139–157. Berkeley: Berkeley Linguis-
tics Society. DOI:10.3765/bls.v13i0.1834
Hsu, Hui-Chin, Alan Fogel & Rebecca B. Cooper. 2000. Infant vocal development
during the first 6 months: Speech quality and melodic complexity. Infant and
Child Development 9(1). 1–16. DOI:10.1002/(SICI)1522-7219(200003)9:1<1::AID-
ICD210>3.0.CO;2-V
Hundt, Marianne. 1997. Has British English been catching up with American
English over the past thirty years? In Magnus Ljung (ed.), Corpus-Based
Studies in English: Papers from the Seventeenth International Conference on
English-Language Research Based on Computerized Corpora, 135–151. Amster-
dam: Rodopi.
Hundt, Marianne. 2009. Colonial lag, colonial innovation or simply language
change? In Günter Rohdenburg & Julia Schlüter (eds.), One language, two gram-
mars?, 13–37. Cambridge: Cambridge University Press.
Hunston, Susan. 2007. Semantic prosody revisited. International Journal of Corpus
Linguistics 12(2). 249–268. DOI:10.1075/ijcl.12.2.09hun
Hunston, Susan & Gill Francis. 2000. Pattern grammar: A corpus-driven approach
to the lexical grammar of English (Studies in corpus linguistics 4). Amsterdam
& Philadelphia: John Benjamins.
Jackendoff, Ray. 1994. Patterns in the mind: Language and human nature. New
York: BasicBooks.
Jäkel, Olaf. 1997. Metaphern in abstrakten Diskurs-Domänen: Eine kognitiv-
linguistische Untersuchung anhand der Bereiche Geistestätigkeit, Wirtschaft und
Wissenschaft (Duisburger Arbeiten zur Sprach- und Kulturwissenschaft, Duis-
burg papers on research in language and culture Bd. 30). Frankfurt am Main
& New York: Peter Lang.
Jankowski, Bridget L. & Sali A. Tagliamonte. 2014. On the genitive’s trail: Data
and method from a sociolinguistic perspective. English Language and Linguis-
tics 18(02). 305–329. DOI:10.1017/S1360674314000045
Jespersen, Otto. 1909. A modern English grammar on historical principles (volumes
1–7). Heidelberg: Winter.
Johansson, Stig & Knut Hofland. 1989. Frequency analysis of English vocabulary
and grammar: Tag frequencies and word frequencies. Vol. 1. Oxford: Clarendon
Press.
Johnson, Wendell. 1944. I. A program of research. Psychological Monographs 56(2).
1–15. DOI:10.1037/h0093508
461
References
Jucker, Andreas H. 1993. The genitive versus the of–construction in British news-
papers. In Andreas H. Jucker (ed.), The noun phrase in English: Its structure and
variability (Anglistik und Englischunterricht 49), 121–136. Heidelberg: Winter.
Justeson, John S. & Slava M. Katz. 1991. Co-occurrences of antonymous adjectives
and their contexts. Computational Linguistics 17(1). 1–19.
Justeson, John S. & Slava M. Katz. 1992. Redefining antonymy: The textual struc-
ture of a semantic relation. Literary and Linguistic Computing 7(3). 176–184.
DOI:10.1093/llc/7.3.176
Kaunisto, Mark. 1999. Electric/electrical and classic/classical: Variation
between the suffixes -ic and ‐ical. English Studies 80(4). 343–370.
DOI:10.1080/00138389908599189
Kennedy, Graeme D. 1998. An introduction to corpus linguistics. London & New
York: Longman.
Kennedy, Graeme D. 2003. Amplifier Collocations in the British National Cor-
pus: Implications for English Language Teaching. TESOL Quarterly 37(3). 467.
DOI:10.2307/3588400
Kilgarriff, Adam, Vı́t Baisa, Jan Bušta, Miloš Jakubı́ček, Vojtěch Kovář, Jan
Michelfeit, Pavel Rychlý & Vı́t Suchomel. 2014. The Sketch Engine: Ten years
on. Lexicography 1(1). 7–36. DOI:10.1007/s40607-014-0009-9
Kjellmer, Göran. 1986. “The lesser man”: Observations on the role of women in
modern English writing. In Jan Aarts & Willem Meijs (eds.), Corpus linguistics
II: New studies in the analysis and exploitation of computer corpora, 163–176.
Amsterdam: Rodopi.
Kjellmer, Göran. 2003. Hesitation. In defence of ER and ERM. English Studies
84(2). 170–198. DOI:10.1076/enst.84.2.170.14903
Koller, Veronika. 2004. Metaphor and gender in business media discourse: A critical
cognitive study. New York: Palgrave Macmillan.
Kuhn, Thomas S. 1962. The structure of scientific revolutions. Chicago: The Uni-
versity of Chicago Press.
Labov, William. 1996. When intuitions fail. In Lisa McNair, Kora Singer, Lise Dol-
brin & Michelle Aucon (eds.), Papers from the parasession on theory and data
in linguistics, vol. 32, 77–106. Chicago, IL: Chicago Linguistic Society.
Labov, William, Sharon Ash & Charles Boberg. 2006. The atlas of North American
English phonetics, phonology, and sound change: A multimedia reference tool.
Berlin & New York: Mouton de Gruyter.
Lakoff, George. 1993. The contemporary theory of metaphor. In Andrew Ortony
(ed.), Metaphor and thought, 2nd edn., 202–252. Cambridge: Cambridge Uni-
versity Press.
462
Lakoff, George. 2004. Re: Empirical methods in Cognitive Linguistics. http : / /
listserv . linguistlist . org / cgi - bin / wa ? A2 = ind0407&L = COGLING&F = &S =
&P=2918.
Lakoff, George & Mark Johnson. 1980. Metaphors we live by. Chicago: Chicago
University Press.
Lakoff, George & Zoltán Kövecses. 1987. The cognitive model of anger inherent in
American English. In Dorothy Holland & Naomi Quinn (eds.), Cultural models
in language and thought, 195–221. Cambridge: Cambridge University Press.
Lakoff, Robin. 1973. Language and woman’s place. Language in Society 2(1). 45–
80.
Langacker, Ronald W. 1973. Language and its structure: Some fundamental linguis-
tic concepts. 2d ed. New York: Harcourt Brace Jovanovich.
Langacker, Ronald W. 1991. Concept, image, and symbol: The cognitive basis of
grammar (Cognitive linguistics research 1). Berlin: Mouton de Gruyter.
Lautsch, Erwin, Gustav A. Lienert & Alexander von Eye. 1988. Strategische Über-
legungen zur Anwendung der Konfigurationsfrequenzanalyse. EDV in Medizin
und Biologie 19(1). 26–30.
Lee, David Y.W. 2001. Genres, registers, text types, domains, and styles: Clari-
fying the concepts and navigating a path through the BNC jungle. Language
Learning & Technology 5(3). 37–72.
Leech, Geoffrey N. & Roger Fallon. 1992. Computer corpora: What do they tell
us about culture? ICAME Journal 16. 29–50.
Leech, Geoffrey N., Roger Garside & Michael Bryant. 1994. CLAWS4: The tag-
ging of the British National Corpus. In Proceedings of the 15th International
Conference on Computational Linguistics, 622–628. Kyoto.
Leech, Geoffrey N. & Andrew Kehoe. 2006. Recent grammatical change in writ-
ten English 1961–1992: Some preliminary findings of a comparison of Amer-
ican with British English. In Antoinette Renouf & Andrew Kehoe (eds.), The
changing face of corpus linguistics (Language and Computers 55), 185–204. Am-
sterdam: Rodopi.
Levin, Magnus. 2014. The Bathroom Formula: A corpus-based study of a
speech act in American and British English. Journal of Pragmatics 64. 1–16.
DOI:10.1016/j.pragma.2014.01.001
Liberman, Mark. 2005. What happened to the 1940s? Blog. http://itre.cis.upenn.
edu/~myl/languagelog/archives/002397.html.
Liberman, Mark. 2012. Historical culturomics of pronoun frequencies. Blog. http:
//languagelog.ldc.upenn.edu/nll/?p=4126.
463
References
Lindquist, Hans & Christian Mair (eds.). 2004. Corpus approaches to grammatical-
ization in English (Studies in corpus linguistics 13). Amsterdam & Philadelphia:
John Benjamins.
Lindsay, Mark. 2011. Rival suffixes: Synonymy, competition, and the emergence
of productivity. In Angela Ralli, Geert E. Booij, Sergio Scalise & Athanasios
Karasimos (eds.), Morphology and the architecture of grammar: On-line proceed-
ings of the Eighth Mediterranean Morphology Meeting, 192–203. Patras: Univer-
sity of Patras.
Lindsay, Mark & Mark Aronoff. 2013. Natural selection in self-organizing mor-
phological systems. In Nabil Hathout, Fabio Montermini & Jesse Tseng
(eds.), Morphology in Toulouse. Selected proceedings of Décembrettes 7, 133–153.
München: Lincom Europa.
Liu, Dilin. 2010. Is it a chief , main, major, primary, or principal concern? A corpus-
based behavioral profile study of the near-synonyms. International Journal of
Corpus Linguistics 15(1). 56–87. DOI:10.1075/ijcl.15.1.03liu
Lohmann, Arne. 2013. Constituent order in coordinate constructions: A processing
perspective. Hamburg: Universität Hamburg. (Dissertation).
Louw, Bill & Carmela Chateau. 2010. Semantic prosody for the 21st Century: Are
prosodies smoothed in academic contexts? A contextual prosodic theoretical
perspective. In Sergio Bolasco, Isabella Chiari & Luca Giuliano (eds.), Statistical
analysis of textual data, 755–764. Rome: Sapienza Università di Roma.
Louw, William E. 1993. Irony in the text or insincerity in the writer? The diag-
nostic potential of semantic prosodies. In Mona Baker, Gill Francis & Elena
Tognini-Bonelli (eds.), Text and technology in honour of John Sinclair, 157–176.
Amsterdam & Philadelphia: John Benjamins.
Lyons, John. 1981. Language and linguistics: An introduction. Cambridge & New
York: Cambridge University Press.
Macmillan, Harold. 1961. House of Commons debate on foreign affairs, 18. Oc-
tober 1961. In Commons and Lords Hansard, the Official Report of debates in
Parliament, vol. 646, cc177–319. London: UK Parliament.
Mair, Christian. 2004. Corpus linguistics and grammaticalisation theory: Statis-
tics, frequencies, and beyond. In Hans Lindquist & Christian Mair (eds.), Stud-
ies in Corpus Linguistics, vol. 13, 121–150. Amsterdam & Philadelphia: John Ben-
jamins.
Manning, Christopher D. & Hinrich Schütze. 1999. Foundations of statistical nat-
ural language processing. Cambridge, MA: MIT Press.
Manning, Susan Karp & Maria Parra Melchiori. 1974. Words that upset urban
college students: Measured with GSRs and rating scales. The Journal of Social
Psychology 94(2). 305–306. DOI:10.1080/00224545.1974.9923225
464
Marco, Maria José Luzon. 2000. Collocational frameworks in medical re-
search papers: A genre-based study. English for Specific Purposes 19(1). 63–86.
DOI:10.1016/S0889-4906(98)00013-1
Martin, James H. 2006. A corpus-based analysis of context effects on metaphor
comprehension. In Anatol Stefanowitsch & Stefan Th. Gries (eds.), Corpus-
based approaches to metaphor, 214–236. Berlin & New York: Mouton de
Gruyter.
Mason, Oliver & Susan Hunston. 2004. The automatic recognition of verb pat-
terns: A feasibility study. International Journal of Corpus Linguistics 9(2). 253–
270. DOI:10.1075/ijcl.9.2.05mas
Matthews, Peter H. 2014. The concise Oxford dictionary of linguistics. Third edition
(Oxford paperback reference). Oxford: Oxford University Press.
McEnery, Tony & Andrew Hardie. 2012. Corpus linguistics: method, theory and
practice (Cambridge textbooks in linguistics). Cambridge & New York: Cam-
bridge University Press.
McEnery, Tony & Andrew Wilson. 2001. Corpus linguistics: an introduction. Edin-
burgh: Edinburgh University Press.
McHugh, Mary L. 2012. Interrater reliability: The kappa statistic. Biochemia Med-
ica 22(3). 276–282.
Merriam-Webster. 2014. How does a word get into a Merriam-Webster dictionary?
FAQ. http://www.merriam-webster.com/help/faq/words_in.htm.
Meurers, W. Detmar. 2005. On the use of electronic corpora for theoretical lin-
guistics. Lingua 115(11). 1619–1639. DOI:10.1016/j.lingua.2004.07.007
Meurers, W. Detmar & Stefan Müller. 2009. Corpora and syntax. In Anke Lüdel-
ing & Merja Kytö (eds.), Corpus linguistics, vol. 2 (Handbooks of Linguistics
and Communication Science 29), 920–933. Berlin & New York: De Gruyter
Mouton.
Meyer, Charles F. 2002. English corpus linguistics: An introduction (Studies in En-
glish language). Cambridge & New York: Cambridge University Press.
Meyer, David E. & Roger W. Schvaneveldt. 1971. Facilitation in recognizing pairs
of words: Evidence of a dependence between retrieval operations. Journal of
Experimental Psychology 90(2). 227–234. DOI:10.1037/h0031564
Michaelis, Laura A. & Knud Lambrecht. 1996. Toward a construction-based the-
ory of language function: The case of nominal extraposition. Language 72(2).
215–247.
Michel, Jean-Baptiste, Yuan Kui Shen, Aviva P. Aiden, Adrian Veres, Matthew K.
Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy,
Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak & Erez Lieberman
465
References
466
Pedersen, Ted. 1996. Fishing for exactness. In Proceedings of the South-Central SAS
Users Group Conference, 188–200. Austin, TX: South-Central SAS Users Group.
Pedersen, Ted. 1998. Dependent bigram identification. In Proceedings of the Fif-
teenth National Conference on Artificial Intelligence and Tenth Innovative Appli-
cations of Artificial Intelligence Conference, 1197. Madison, WI: AAAI Press/The
MIT Press.
Pinker, Steven. 1994. The language instinct. 1st ed. New York: W. Morrow & Co.
Plag, Ingo. 1999. Morphological productivity. Structural constraints in English
derivation. Berlin & New York: Mouton de Gruyter.
Popper, Karl R. 1959. The logic of scientific discovery. London: Hutchinson.
Popper, Karl R. 1963. Conjectures and refutations: the growth of scientific knowledge.
London & New York: Routledge & Kegan Paul.
Popper, Karl R. 1970. A realist view of logic, physics, and history. In Wolfgang
Yourgrau & Allen D. Breck (eds.), Physics, logic, and history, 1–37. Boston, MA:
Springer. DOI:10.1007/978-1-4684-1749-4_1
Quirk, Randolph, Sidney Greenbaum, Geoffrey N. Leech & Jan Svartvik. 1972. A
Grammar of contemporary English. London: Longman.
Quirk, Randolph, Jan Svartvik & Geoffrey N. Leech. 1985. A Comprehensive gram-
mar of the English language. London & New York: Longman.
Rayson, Paul. 2008. From key words to key semantic domains. International Jour-
nal of Corpus Linguistics 13(4). 519–549. DOI:10.1075/ijcl.13.4.06ray
Rayson, Paul, Geoffrey N. Leech & Mary Hodges. 1997. Social differentiation in
the use of English vocabulary: Some analyses of the conversational component
of the British National Corpus. International Journal of Corpus Linguistics 2(1).
133–152. DOI:10.1075/ijcl.2.1.07ray
Read, Timothy R. C & Noel A. C Cressie. 1988. Goodness-of-fit statistics for discrete
multivariate data. New York, NY: Springer.
Renouf, Antoinette. 1987. Lexical resolution. In Willem Meijs (ed.), Proceedings of
the Seventh International Conference on English Language Research on Comput-
erised Corpora, 121–131. Amsterdam: Rodopi.
Renouf, Antoinette & John Sinclair. 1991. Collocational frameworks in English.
In Karin Aijmer & Bengt Altenberg (eds.), English corpus linguistics: studies in
honour of Jan Svartvik, 128–143. London: Longman.
Robinson, Andrew. 2002. Lost languages: The enigma of the world’s undeciphered
scripts. New York: McGraw-Hill.
Rohdenburg, Günter. 1995. On the replacement of finite complement
clauses by infinitives in English. English Studies 76(4). 367–388.
DOI:10.1080/00138389508598980
467
References
468
Sacks, Harvey, Emanuel A. Schegloff & Gail Jefferson. 1974. A simplest systemat-
ics for the organization of turn-taking for conversation. Language 50(4). 696.
DOI:10.2307/412243
Säily, Tanja. 2011. Variation in morphological productivity in the BNC: Sociolin-
guistic and methodological considerations. Corpus Linguistics and Linguistic
Theory 7(1). 119–141. DOI:10.1515/cllt.2011.006
Säily, Tanja & Jukka Suomela. 2009. Comparing type counts: The case of women,
men and -ity in early English letters. Language and Computers – Studies in
Practical Linguistics 69. 87–109.
Sampson, Geoffrey. 1987. Evidence against the “Grammatical”/“Ungrammatical”
distinction. In Willem Meijs (ed.), Corpus linguistics and beyond, 219–226. Am-
sterdam: Rodopi.
Sampson, Geoffrey. 1995. English for the computer: The SUSANNE corpus and ana-
lytic scheme. Oxford & New York: Clarendon Press & Oxford University Press.
Santorini, Beatrice. 1990. Part-of-speech tagging guidelines for the Penn Treebank
project. Technical Report MS-CIS-90-47. Philadelphia: University of Pennsyl-
vania, Department of Computer & Information Science. 32.
Sapir, Edward. 1921. Language: An introduction to the study of speech. New York:
Harcourt, Brace & Co.
Schlüter, Julia. 2003. Phonological determinants of grammatical variation in En-
glish: Chomsky’s worst possible case. In Günter Rohdenburg & Britta Mondorf
(eds.), Determinants of Grammatical Variation in English, 69–118. Berlin & New
York: Mouton de Gruyter.
Schmid, Hans-Jörg. 1996. Introspection and computer corpora: The meaning and
complementation of start and begin. In Arne Zettersten & Viggo Hjørnager
(eds.), Symposium on Lexicography VII. Proceedings of the Seventh Symposium
on Lexicography, May 5-6, 1994 at the Unversity of Copenhagen (Lexicographica,
Series Major 76), 223–239. Tübingen: Niemeyer.
Schmid, Hans-Jörg. 2003. Do women and men really live in different cultures?
Evidence from the BNC. In Andrew Wilson, Paul Rayson & Tony McEnery
(eds.), Corpus linguistics by the Lune: A Festschrift for Geoffrey Leech, 185–221.
Frankfurt: Peter Lang.
Schmid, Helmut. 1994. Probabilistic part-of-speech tagging using decision trees.
In Proceedings of International Conference on New Methods in Language Process-
ing, 44–49. Manchester: University of Manchester.
Schmilz, Ulrich. 1983. Zählen und Erzählen: Zur Anwendung statistischer Ver-
fahren in der Textlinguistik. Zeitschrift für Sprachwissenschaft 2(1). 132–143.
DOI:10.1515/ZFSW.1983.2.1.132
469
References
470
Stallard, David. 1993. Two kinds of metonymy. In Proceedings of the 31st annual
meeting on Association for Computational Linguistics, 87–94. Stroudsburg: As-
sociation for Computational Linguistics. DOI:10.3115/981574.981586
Standwell, G.J.B. 1982. Genitive constructions and functional sentence perspec-
tive. IRAL: International Review of Applied Linguistics in Language Teaching
20(1-4). 257–262. DOI:10.1515/iral.1982.20.1-4.257
Steen, Gerard J., Ewa Biernacka, Aletta G. Dorst, Anna Kaal, Clara I. López Ro-
drı́guez & Trijntje Pasma. 2010. Pragglejaz in practice: Finding metaphorically
used words in natural discourse. In Graham Low, Zazie Todd, Alice Deignan
& Lynne Cameron (eds.), Researching and applying metaphor in the real world,
vol. 26 (Human Cognitive Processing), 165–184. Amsterdam & Philadelphia:
John Benjamins.
Stefanowitsch, Anatol. 2003. Constructional semantics as a limit to grammatical
alternation: The two genitives of English. In Günter Rohdenburg & Britta Mon-
dorf (eds.), Determinants of grammatical variation in English (Topics in English
Linguistics 43), 413–444. Berlin & New York: Mouton de Gruyter.
Stefanowitsch, Anatol. 2004. HAPPINESS in English and German: A
metaphorical-pattern analysis. In Michel Achard & Suzanne Kemmer
(eds.), Language, culture, and mind, 137–149. Stanford, CA: CSLI.
Stefanowitsch, Anatol. 2005. The function of metaphor: Developing a corpus-
based perspective. International Journal of Corpus Linguistics 10(2). 161–198.
DOI:10.1075/ijcl.10.2.03ste
Stefanowitsch, Anatol. 2006a. Distinctive collexeme analysis and diachrony:
A comment. Corpus Linguistics and Linguistic Theory 2(2). 257–262.
DOI:10.1515/CLLT.2006.013
Stefanowitsch, Anatol. 2006b. Negative evidence and the raw fre-
quency fallacy. Corpus Linguistics and Linguistic Theory 2(1). 61–77.
DOI:10.1515/CLLT.2006.003
Stefanowitsch, Anatol. 2006c. Words and their metaphors: A corpus-based ap-
proach. In Anatol Stefanowitsch & Stefan Th. Gries (eds.), Corpus-based ap-
proaches to metaphor (Trends in Linguistics), 61–105. Berlin & New York: Mou-
ton de Gruyter.
Stefanowitsch, Anatol. 2007a. Linguistics beyond grammaticality. Corpus Linguis-
tics and Linguistic Theory 3(1). 57–71. DOI:10.1515/CLLT.2007.004
Stefanowitsch, Anatol. 2007b. Wortwiederholungen im Englischen und
Deutschen: Eine korpuslinguistische Annährerung. In Andreas Ammann
& Aina Urdze (eds.), Wiederholung, Parallelismus, Reduplikation: Strategien
der multiplen Strukturanwendung (Diversitas Linguarum), 29–45. Bochum:
Brockmeyer.
471
References
472
Stefanowitsch, Anatol & Stefan Th. Gries. 2009. Corpora and grammar. In Anke
Lüdeling & Merja Kytö (eds.), Corpus linguistics: An international handbook,
vol. 2 (Handbooks of Linguistics and Communication Science 29), 933–952.
Berlin & New York: De Gruyter Mouton.
Strömqvist, Sven & Ludo Th. Verhoeven (eds.). 2003. Relating events in narrative:
Typological and contextual perspectives. Hillsdale, NJ: Lawrence Erlbaum Asso-
ciates.
Stubbs, Michael. 1995a. Collocations and cultural connotations of common words.
Linguistics and Education 7(4). 379–390. DOI:10.1016/0898-5898(95)90011-X
Stubbs, Michael. 1995b. Collocations and semantic profiles: On the cause of
the trouble with quantitative studies. Functions of Language 2(1). 23–55.
DOI:10.1075/fol.2.1.03stu
Subtirelu, Nicholas. 2014. Do we talk and write about men more than women? Blog.
http://linguisticpulse.com/2014/04/09/do- we- talk- and- write- about- men-
more-than-women/.
Swaine, Jon. 2009. Apostrophes abolished by council. Newspaper. http : / / www .
telegraph . co . uk / news / newstopics / howaboutthat / 4388343 / Apostrophes -
abolished-by-council.html.
Szmrecsanyi, Benedikt. 2005. Language users as creatures of habit: A corpus-
based analysis of persistence in spoken English. Corpus Linguistics and Lin-
guistic Theory 1(1). 113–150. DOI:10.1515/cllt.2005.1.1.113
Szmrecsanyi, Benedikt. 2006. Morphosyntactic persistence in spoken English: A
corpus study at the intersection of variationist sociolinguistics, psycholinguistics,
and discourse analysis (Trends in linguistics 177). Berlin & New York: Mouton
de Gruyter.
Tagliamonte, Sali A. 2006. Analysing sociolinguistic variation (Key topics in soci-
olinguistics). Cambridge & New York: Cambridge University Press.
Taylor, John R. 2003. Near synonyms as co-extensive categories: ‘high’ and
‘tall’ revisited. Language Sciences 25(3). 263–284. DOI:10.1016/S0388-
0001(02)00018-9
Taylor, John R. 2012. The mental corpus: How language is represented in the mind.
Oxford & New York: Oxford University Press.
The Stationery Office. 1999. Deceased estates notice, Wilfrid Thomas Froggatt Cas-
tle. https://www.thegazette.co.uk/notice/L-55697-999.
Thompson, Sandra A. & Yuka Koide. 1987. Iconicity and “indirect objects” in En-
glish. Journal of Pragmatics 11(3). 399–406. DOI:10.1016/0378-2166(87)90139-1
Tissari, Heli. 2003. Lovescapes: Changes in prototypical senses and cognitive
metaphors since 1500 (Mémoires de la Société Néophilologique de Helsinki
LXII). Helsinki: Société Néophilologique.
473
References
Tissari, Heli. 2010. English words for emotions and their metaphors. In Margaret
E. Winters, Heli Tissari & Kathryn Allan (eds.), Historical cognitive linguistics
(Cognitive linguistics research 47), 298–330. Berlin & New York: De Gruyter
Mouton.
Tomasello, Michael. 2003. Constructing a language: A usage-based theory of lan-
guage acquisition. Cambridge, MA: Harvard University Press.
Trips, Carola. 2009. Lexical semantics and diachronic morphology the development
of -hood, -dom and -ship in the history of English. Tübingen: Niemeyer.
Tummers, Jose, Kris Heylen & Dirk Geeraerts. 2005. Usage-based approaches
in Cognitive Linguistics: A technical state of the art. Corpus Linguistics and
Linguistic Theory 1(2). 225–261. DOI:10.1515/cllt.2005.1.2.225
Turkkila, Kaisa. 2014. Do near-synonyms occur with the same metaphors: A com-
parison of anger terms in American English. metaphorik.de 25. 129–154.
Twenge, Jean M., W. Keith Campbell & Brittany Gentile. 2012. Male and female
pronoun use in U.S. books reflects women’s status, 1900–2008. Sex Roles 67(9-
10). 488–493. DOI:10.1007/s11199-012-0194-7
Vosberg, Uwe. 2003. The role of extractions and horror aequi in the evolution of
-ing complements in Modern English. In Günter Rohdenburg & Britta Mondorf
(eds.), Determinants of Grammatical Variation in English, 305–328. Berlin &
New York: Mouton de Gruyter.
Wallington, Alan, John A. Barnden, Marina A. Barnden, Fiona J. Ferguson &
Sheila R. Glasbey. 2003. Metaphoricity signals: A corpus-based investigation.
Technical Report CSRP-03-05. Birmingham: The University of Birmingham,
School of Computer Science. 1–12.
Wasow, Tom & Jennifer Arnold. 2003. Post-verbal constituent ordering in En-
glish. In Günter Rohdenburg & Britta Mondorf (eds.), Determinants of gram-
matical variation in English (Topics in English Linguistics 43), 119–154. Berlin
& New York: Mouton de Gruyter.
Wasserstein, Ronald L. & Nicole A. Lazar. 2016. The ASA’s statement on p-
values: Context, process, and purpose. The American Statistician 70(2). 129–
133. DOI:10.1080/00031305.2016.1154108
Widdowson, Henry G. 2000. On the limitations of linguistics applied. Applied
Linguistics 21(1). 3–25. DOI:10.1093/applin/21.1.3
Wiechmann, Daniel. 2008. On the computation of collostruction strength: Test-
ing measures of association as expressions of lexical bias. Corpus Linguistics
and Linguistic Theory 4(2). 253–290. DOI:10.1515/CLLT.2008.011
Wiederhorn, Sheldon M., Richard J. Fields, Samuel Low, Gun-Woong Bahng,
Alois Wehrstedt, Junhee Hahn, Yo Tomota, Takashi Miyata, Haiqing Lin,
Benny D. Freeman, Shuji Aihara, Yukito Hagihara & Tetsuya Tagawa. 2011.
474
Mechanical properties. In Horst Czichos, Tetsuya Saito & Leslie Smith (eds.),
Springer Handbook of Metrology and Testing, 339–452. Berlin & Heidelberg:
Springer.
Wierzbicka, Anna. 1988. The semantics of grammar (Studies in language compan-
ion series 18). Amsterdam & Philadelphia: John Benjamins.
Wierzbicka, Anna. 2003. Cross-cultural pragmatics the semantics of human inter-
action. Berlin & New York: Mouton de Gruyter.
Wikipedia contributors. 2018. Regular expression. https://en.wikipedia.org/wiki/
Regular_expression.
Williams, Raymond. 1976. Keywords: A vocabulary of culture and society. New
York: Oxford University Press.
Winchester, Simon. 2003. The meaning of everything: The story of the Oxford En-
glish dictionary. Oxford & New York: Oxford University Press.
Wolf, Hans-Georg & Frank Polzenhagen. 2007. Fixed expressions as manifesta-
tions of cultural conceptualizations: Examples from African varieties of En-
glish. In Paul Skandera (ed.), Phraseology and culture in English (Topics in En-
glish Linguistics 54), 399–435. Berlin & New York: Mouton de Gruyter.
Wulff, Stefanie. 2003. A multifactorial corpus analysis of adjective or-
der in English. International Journal of Corpus Linguistics 8(2). 245–282.
DOI:10.1075/ijcl.8.2.04wul
Wynne, Martin (ed.). 2005. Developing linguistic corpora: A guide to good practice
(AHDS guides to good practice). Oxford & Oakville, CT: Oxbow Books.
Xiao, Richard. 2008. Well-known and influential corpora. In Anke Lüdeling &
Merja Kytö (eds.), Corpus Linguistics, vol. 1 (Handbooks of Linguistics and Com-
munication Science 29), 383–483. Berlin & New York: Walter de Gruyter.
Yarowsky, David. 1993. One sense per collocation. In Human language technology:
proceedings of a workshop held at Plainsboro, New Jersey, March 21-24, 1993, 266–
271. Association for Computational Linguistics. DOI:10.3115/1075671.1075731
Zaenen, Annie, Jean Carletta, Gregory Garretson, Joan Bresnan, Andrew Koontz-
Garboden, Tatiana Nikitina, M. Catherine O’Connor & Tom Wasow. 2004. Ani-
macy encoding in English: Why and how. In Proceedings of the 2004 ACL Work-
shop on Discourse Annotation (DiscAnnotation ’04), 118–125. Stroudsburg, PA:
Association for Computational Linguistics.
475
Name index
478
Name index
479
Name index
480
Subject index
academic language, 34, 202, 203, American English, 12, 134 , 30, 32, 33,
358–360, 422 371 , 39, 56, 58, 59, 65–73,
acceptability, 75, 122, 130, 131, 274, 75–77, 107, 119, 141, 148, 224,
275, 305, 306 298, 299, 315, 316, 334, 363,
adjective, 48, 51, 84, 86, 108, 110, 206, 365, 367, 368, 370, 372, 378,
216, 218, 220, 221, 223–225, 412
229, 230, 232, 234, 235, 237, animacy, 96, 98–100, 120, 122–124,
241, 242, 244, 246, 249, 250, 126, 127, 129, 137, 138, 142,
255, 258, 259, 263, 264, 268, 144, 146, 149, 157, 158, 161–
269, 283, 285, 292, 293, 310, 163, 170, 171, 173, 187, 188,
325, 326, 367, 399, 400, 435 191, 199, 205, 237, 255, 277,
adposition, 51, 75, 85, 89, 108, 110, 129, 282, 436
149, 206, 249, 254, 298, 299, annotation, 23, 38, 39, 41–43, 45, 46,
428 51, 58, 72, 73, 83, 86–89, 102,
adverb, 75, 94, 110, 215, 234, 235, 249, 103, 105, 106, 108–111, 113,
255, 269, 281, 282, 297, 312, 115, 116, 119–122, 1223 , 123–
315, 367, 378, 426, 428 129, 1294 , 130, 131, 133–139,
affix, 75, 235, 309, 310, 312, 313, 315– 141, 149, 150, 157, 158, 169,
317, 3171 , 318, 321, 323, 325– 181, 184, 188, 199, 212, 215,
328, 3284 , 330, 331, 336, 338, 240, 246, 249, 261, 277, 278,
340–343, 345–351, 426 282, 293, 295, 296, 301, 303,
affix combinations, 340–343 332, 353, 374, 397–399, 403,
African English, 370, 371 414, 417, 419, 426, 438
age, 29, 39, 45, 57, 120, 147, 186, 187, antitype (CFA), 211, 212, 278, 282
205, 206, 212, 353, 372, 378 antonymy, 77, 235, 240–242, 244, 310,
agent, 99, 277, 278, 387, 391 330, 399, 425
aktionsart, 276–280, 282, 283, 326, association, 51, 1223 , 182, 1824 , 216,
387, 388 220, 224–230, 232–235,
alternation, 90, 98, 148, 272, 276, 282, 2356 , 240–242, 249, 250,
315 255, 258, 263, 265–267, 269,
ambiguity, 89, 106, 114, 128, 262, 310, 270, 272, 274, 276–278, 305,
368 313, 315, 334, 354, 357, 360,
Subject index
363, 368, 370–372, 384, 387, 120, 131, 135, 136, 138, 144,
392, 397, 402, 407, 408, 410, 149, 241, 299, 313, 334, 336,
412, 419, 422, 425, 432, 439 363, 365, 368, 370, 384, 385,
association measure, 224–228, 230, 412
232, 233, 235, 2356 , 265–
267, 270, 363 cardinal data, 141, 145–148, 158, 162,
authenticity, 1, 14, 18, 186 , 22–27, 38, 163, 166, 176, 191, 197–199,
39, 43, 45, 46, 59, 117, 300, 208, 3425
302 categorization, 7, 32, 34, 58, 59, 73, 77,
81, 85, 86, 97–100, 105, 120–
bigram, 2171 , 394 126, 129, 136, 139, 143–147,
binomial (coordinated nouns), 286, 157, 162, 182, 184, 187, 197,
288, 289 206, 237, 240, 242, 249, 253,
binomial test, 187 261, 276–278, 281, 282, 330,
bivariate analysis, 184, 204, 205, 208, 332, 334, 336, 371, 378, 388,
211, 213, 214, 279, 280 399, 403, 405, 416, 417, 439
BNC, 6, 7, 12, 134 , 34, 84, 85, 89, 119, chance, 32–34, 88, 129, 131, 1441 , 154,
122, 135, 184–186, 206, 207, 155, 161, 164, 166, 168–170,
218–220, 223, 224, 229, 232, 172–174, 176, 178–181, 1823 ,
234, 237, 242, 245, 250, 255, 184, 188, 192, 202, 203, 215–
258, 264–266, 281, 283, 286, 218, 220, 222, 274, 275, 293,
290, 293, 296, 297, 305–307, 402
310–312, 315, 316, 318, 326– chi-square test, 176, 177, 179–181, 183,
328, 3284 , 334, 336, 343, 345, 184, 186, 187, 191, 198, 200–
348–351, 372, 374, 378, 403, 203, 207–209, 211, 212, 218,
410, 419, 425, 426, 428, 429, 221, 225–228, 232, 233, 237,
431, 432, 435 241, 247, 255, 274, 278, 279,
BNC Baby, 34, 51, 200, 341, 343, 360, 282, 291–295, 313, 318, 321,
399, 422, 424 322, 325, 330, 334, 346, 349,
Bonferroni correction, 203, 2036 , 363, 378, 385, 403, 425, 431
222, 241 clitic, 85, 108, 129, 149, 200, 206, 296,
British English, 5, 6, 12, 33, 56, 58, 59, 297, 310, 363, 365
65–73, 75–77, 90, 912 , 107, CLMET, 300, 301, 388
120, 121, 141, 148, 224, 296, coding, 41, 42, 84, 105, 121, 123, 139,
298, 299, 301, 315, 316, 334, 158, 198, 271, 288, 365
363, 365, 367, 368, 370, 372, cognitive linguistics, 7, 282, 397, 398,
378, 412 400, 405, 406, 408, 410, 419,
BROWN, 30, 32–34, 371 , 43, 56, 66, 431, 439
70, 73, 76, 84, 85, 108, 110, COHA, 392, 393
482
Subject index
483
Subject index
demography, 23, 25, 29, 39, 45, 57, 70, emotions, 116–118, 245–249, 277, 319,
119, 121, 145, 147, 207, 261, 400, 410–412, 414, 416–419,
295, 353, 354, 361, 362, 372, 425
378, 380, 387 epistemology, 15, 16, 221, 222
description, 12, 17, 18, 21, 371 , 46, 62, experimental method, 1, 41 , 9, 10, 15,
123, 125, 148, 158, 166, 223, 17, 27, 28, 641 , 101–103, 122,
234, 246, 264, 269, 303, 323, 218, 282
325, 406, 437 explanation, 14, 18, 56, 57, 64, 100, 101,
determiner, 56, 86–88, 108, 113–115, 1823 , 340, 395
130, 131, 149, 215, 310 exploration, 62, 222, 224, 233–235,
diachrony, 298, 302, 325, 328, 336, 240, 244
349, 388, 392
dictionary, 18, 46–48, 482 , 63, 70, 72, falsification, 63, 67–69, 73, 74, 76,
75, 77–79, 96, 98, 216, 223, 100–103, 167–169
234, 264, 286, 309, 325, 327, figurative language, 405, 430, 436
328, 330–332, 336, 340, 399, metaphor, 30, 116–118, 120, 332,
400, 402 397–400, 402, 403, 405–408,
distribution 410–412, 414, 416–419, 422,
conditional, 51, 54–56, 58, 59, 61, 424–426, 428–432, 434, 435
65–67, 69, 73, 74, 88, 119, metonymy, 99, 431, 434, 435,
120, 125, 141, 144, 146, 148, 4351 , 436
154–156, 162, 166, 168–170, Fisher’s exact test, 179, 228, 229, 232,
172, 176, 178–180, 183–187, 233, 274, 330
191, 194, 201, 205, 207, 214– FLOB, 33, 56, 70, 84, 85, 135, 299–301,
216, 218, 220, 228, 232, 237, 316, 318, 370, 371, 385
254, 258, 261, 286, 292–295, frequency, 51, 53–56, 59, 69, 76, 136,
299, 309, 315, 318, 327, 330, 143–146, 148, 150–155, 158,
338, 341, 342, 363, 367, 374, 162, 163, 169, 172, 173, 177,
385, 387, 397, 402, 410, 416, 184, 201, 218, 219, 223, 229,
425, 436, 439 230, 233, 235, 237, 252, 255,
normal, 196–198, 338, 3425 264–266, 268, 269, 274, 278,
ditransitivity, 14, 494 , 113–116, 118, 281, 283, 285, 286, 288, 290,
119, 254, 270–272, 274, 303 297–302, 310, 315, 3425 , 355,
359, 360, 365, 367, 368, 384,
effect size, 174, 182, 184, 195, 201, 205– 387, 392–395, 419, 432, 439
207, 213, 214, 224–226, 295, expected, 154–156, 162, 166, 169,
318, 323, 325, 349–351 170, 172, 175, 178–180, 184–
elicitation, 10, 15, 22, 27, 28, 119, 122, 187, 194, 198, 200–203, 209,
275 211, 2117 , 212, 217, 218, 220,
484
Subject index
485
Subject index
486
Subject index
mutual information, 226, 227, 232– OED, 18, 46, 47, 63, 300–302, 317, 327,
235 328
operationalization, 58, 61, 77–81, 83–
𝑛-gram, 37, 392
85, 87–90, 92, 923 , 93, 95–
negation, 43, 246, 296, 297
98, 100, 102, 103, 119–123,
negative evidence, 3, 272
131, 149, 150, 157, 163, 169,
neologism, 326–328, 3284 , 329, 330,
174, 181, 183, 198, 206, 224,
347
254, 261, 277, 289, 317, 326,
newspaper language, 24, 29, 32, 34,
327, 338, 347, 348, 367, 374,
46, 70, 200, 203, 204, 258,
380, 384, 385, 406, 438, 439
298, 331, 345–347, 358, 385,
order, 93, 115, 118, 298
422, 432
ordinal data, 86, 141, 145–148, 157,
nominal data, 141, 143–148, 150, 156–
158, 161, 162, 166, 176, 177,
158, 162, 163, 166, 176, 177,
187, 188, 198, 199, 208, 289,
187, 188, 190, 191, 194, 198,
342, 3425 , 372
199, 208, 209, 234, 237, 240,
255, 270, 277, 289, 293, 305, 𝑝-value, 173–175, 213
318, 331, 354, 362, 367, 372, paradigmatic relation, 240
385, 403, 406 paralinguistic, 23, 24, 39, 42, 43
2
noun, 5 , 43, 48, 50, 51, 53, 82, 85–89, passive voice, 585 , 115, 250, 278, 292,
93, 97, 98, 108–111, 114, 115, 414
122, 124, 126, 127, 131, 137, pattern grammar, 270, 439
138, 149–151, 155, 197, 200– performance, 3, 4, 11, 75
203, 206, 215, 220, 225, 229, phoneme, 90–92
230, 232, 237, 245–247, 249, POS tagging, 81, 83–89, 108–110, 113–
250, 252, 254, 255, 258, 262– 116, 123, 127, 149, 150, 200,
264, 266, 268, 283, 286, 288, 206, 241, 250, 262, 264, 296–
289, 306, 310, 325, 326, 331, 298, 303, 310
332, 334, 345, 357, 359, 365, possessive, 81–83, 86, 90, 129–131,
367, 374, 388, 399, 405–407, 137, 138, 142–144, 1441 , 145–
414, 417–419, 424, 435, 436 158, 161–164, 166, 177–179,
null hypothesis, 168, 1681 , 169, 170, 181, 182, 1823 , 183, 187, 190–
172–175, 178, 187, 188, 192, 192, 194–196, 198–202, 205,
221, 222, 228 206, 211–213, 245, 247, 261,
number, 43, 74, 105, 109, 246, 309, 262, 282, 310–312, 414
374, 402, 403, 405 possessive compound, 82
pragmatics, 3, 7, 8, 11, 186 , 244, 438
observational method, 1, 2, 4, 7, 16,
precision, 112, 1121 , 113–116, 127, 149,
17, 22, 25, 59, 61, 62, 67, 137,
262, 296, 309, 327, 419
143, 145, 154, 183, 437
487
Subject index
488
Subject index
489
Subject index
490
Did you like this
book?
This book was brought to you for
free
Corpora are used widely in linguistics, but not always wisely. This book attempts to
frame corpus linguistics systematically as a variant of the observational method. The
first part introduces the reader to the general methodological discussions surrounding
corpus data as well as the practice of doing corpus linguistics, including issues such as
the scientific research cycle, research design, extraction of corpus data and statistical
evaluation. The second part consists of a number of case studies from the main areas of
corpus linguistics (lexical associations, morphology, grammar, text and metaphor), sur-
veying the range of issues studied in corpus linguistics while at the same time showing
how they fit into the methodology outlined in the first part.
ISBN 978-3-96110-224-2
9 783961 102242