Types
Types
Types
written
monolingual vs. bi/multilingual
parallel vs. comparable corpora (translation corpora)
general language purpose vs. specialised
language purpose
diachronic vs. synchronic
plain text vs. annotated (tagged) text
Dr AFIDA MOHAMAD ALI
FBMK
Spoken Corpora
aim at representing spoken language
London-Lund Corpus (LLC)
Lancaster/IBM Spoken English Corpus (SEC)
Cambridge and Nottingham Corpus of
Discourse in English (CANCODE)
Santa Barbara Corpus of Spoken American
English (SBCSAE)
Wellington Corpus of Spoken New Zealand
English (WSC)
Written Corpora
aim at representing written language
BROWN Corpus (written texts, AE in 1961)
LOB Corpus (Comparable to BROWN Corpus,
BE, early 1960s)
FROWN Corpus (AE, Early 1990s)
FLOB Corpus (BE, Early 1990’s)
Multilingual Corpora
aim at representing several, at least two, different
languages, often with the same text types (for
contrastive analyses)
Parallel corpora (source texts plus translations):
Canadian Hansard
http://martinweisser.org/corpora_site/CBLLinks.
html
Comparable vs. parallel
corpora
The sampling frame is essential for
comparable corpora but not for parallel
corpora because the texts are exact
translations of
each other.
General Corpora
Broadest type of corpus – very large, more than 10
million words, and contain a variety of language so
that findings from it may be somewhat generalized.
Spoken Written
Monolingual Bi-/Multi-lingual
Types of corpora
Monolingual
Reference corpora
Medical
Corpora Economic
corpora Legal
corpora
Types of corpora
Bi-multilingual
Comparable
Parallel
L1 L2 L3 L-N
Translations
L1 to L2 Bidirectional
L1 to L2 Free
L2 to L1 Translat
Types of corpora
Written Corpora
Synchronic Diachronic
(e.g. varieties of English: (e.g. Modern English,
BrEn, USEn, Euro-English, etc.) Medieval English, etc.)
English Corpora
The Brown Corpus (1964)
1 million words (500 samples/2,000 words, written
American English, texts published in the US in 1961
The Lancaster-Oslo/Bergen (LOB) Corpus (1978)
similar to the Brown corpus, British English, text from
1961 (compiled 1970-1978)
English Corpora
The London-Lund Corpus (LLC)
200 samples, ~5000 words each, 1953-1987, spoken
British English, transcribed.
The Frown Corpus
Freiburg-Brown Corpus of American English (1992)
1990s analogue to the Brown corpus (1 million
words, written American-English.
The FLOB Corpus
Freiburg-LOB Corpus of British English, 1990s
analogue to the LOB corpus (1 million words,
written British English).
English Corpora
The British National Corpus (BNC)
100 million-word, samples of written texts (90m
words) and spoken language (10m words).
The International Corpus of English (ICE)
500 samples (300 spoken, 200 written), ~2,000 words
each, 1990 onwards, 20 national varieties of English
(e.g. UK, India, Singapore, Australia, India, Jamaica)
The BoE Corpus (The Bank of English Corpus)
450M words, full texts, open, written and spoken,
mainly US and UK
Web Corpora
Adam Kilgariff - http://www.kilgarriff.co.uk/
Web Corpora:
Google: www.google.com
www.webcorp.org.uk
BootCat
http://corpora.fi.muni.cz/bootcat/