Corpus Linguistics
Corpus Linguistics
Corpus Linguistics
1. Introduction
Corpus Linguistics is a multidimensional area. It is an area with a wide spectrum for encompassing all diversities of language use in all domains of linguistic interaction, communication, and comprehension. The introduction of corpus in language study and application has incorporated a new dimension to linguistics. In principle, Corpus Linguistics is an approach that aims at investigating language and all its properties by analysing large collections of text samples. This approach has been used in a number of research areas for ages: from descriptive study of a language, to language education, to lexicography, etc. It broadly refers to exhaustive analysis of any substantial amount of authentic, spoken and/or written text samples. In general, it covers large amount of machine-readable data of actual language use that includes the collections of literary and non-literary text samples to reflect on both the synchronic and diachronic aspects of a language. The uniqueness corpus linguistics lies in its way of using modern computer technology in collection of language data, methods used in processing language databases, techniques used in language data and information retrieval, and strategies used in application of these in all kinds language-related research and development activities. Electronic (digital) language corpus is a new thing. It has a history of nearly half a century. Therefore, we are yet to come to a common consensus as to what counts as corpus, and how it should be designed, developed, classified, processed and utilised. The basic philosophy behind corpus linguistics has two wings: (a) we have a cognitive drive to know how people use language in their daily communication activities, and (b) if it is possible to build up intelligent systems that can efficiently interact with human beings. With this motivation both computer scientists and linguists have come together to develop language corpus that can be used for designing intelligent systems (e.g., machine translation system, language processing system, speech understanding system, text analysis and understanding system, computer aided instruction system, etc.) for the benefit of the language community at large. All branches of linguistics and language technology can benefit from insights obtained from analysis of corpora. Thus, description and analysis of linguistic properties collected from a corpus becomes of paramount importance in all many areas of human knowledge and application.
2. What is Corpus?
The term corpus is derived from Latin corpus "body". At present it means representative collection of texts of a given language, dialect or other subset of a language to be used for linguistic analysis. In finer definition, it refers to (a) (loosely) any body of text; (b) (most commonly) a body of machinereadable text; and (c) (more strictly) a finite collection of machine-readable texts sampled to be representative of a language or variety (McEnery and Wilson 1996: 218). Corpus contains a large collection of representative samples of texts covering different varieties of language used in various domains of linguistic interactions. Theoretically, corpus is (C)apable (O)f (R)epresenting (P)otentially (U)nlimited (S)elections of texts. It is compatible to computer, operational in research and application, representative of the source language, processable by man and machine, unlimited in data, and systematic in formation and representation (Dash 2005: 35).
N. S. Dash: Corpus Linguistics: A General Introduction. CIIL, Mysore, 25th August 2010
3 Indian Institute of Technology, Delhi was given the task for developing corpus of Indian English, Hindi, and Punjabi; Central Institute of Indian Languages, Mysore was assigned for corpus of Tamil, Telugu, Kannada, and Malayalam; Deccan College, Pune developed corpus of Marathi and Gujarati; Indian Institute of Applied Language Sciences, Bhubaneswar developed corpus of Oriya, Bengali and Assamese; Sampurnananda Sanskrit University, Varanasi was entrusted with the development of corpus of Sanskrit; and Aligarh Muslim University, Aligarh was assigned the task for developing corpus of Urdu, Sindhi, and Kashmiri; Indian Institute of Technology, Kanpur took responsibility for designing systems and software for language processing and machine translation, while CIIL, Mysore took responsibility to archive the entire corpus database of all Indian languages for future utilization. After the completion of the TDIL project in 1995, works on corpus development and processing stopped for some reasons beyond the knowledge of the present author. However, the MICT, Govt. of India has revived the whole enterprise (http:\\www.mit.gov.in) with new enthusiasm and vision. The realisation of this enterprise is clearly manifested in formation of LDC-IL, although my personal view is that it should be NAIL (National Archive for the Indian Languages), rather than the LDC-IL (Dash 2003).
4 Special Corpus: Special corpus (e.g., CHILDES Database) is designed from texts sampled in general corpus for specific variety of language, dialect and subject with emphasis on certain properties of the topic under investigation. It varies in size and composition according to purpose. It does not contribute to the description of a language because it contains a high proportion of unusual features. Its origin is not reliable as it records the data from people not behaving normally. Special corpus is not balanced (except within the scope of its given purpose) and, if used for other purposes, gives distorted and skewed view of language segments. It is different in principle, since it features one or other variety of normal, authentic language. Corpus of language of children, non-native speakers, users of dialects, and special areas of communication (e.g., auction, medical talks, gambling, court proceeding, etc.) are designated as special corpus because of their non-representative nature of the language involved. Its main advantage is that texts are selected in such a way that the phenomena one is looking for occur more frequently in it than in balanced corpus. A corpus that is enriched in such a way is smaller than a balanced corpus providing same type of data (Sinclair 1996b). Sublanguage corpus: It consists of only one text variety of a particular language. It is at the other end of the linguistic spectrum of a Reference corpus. The homogeneity of its structure and specialised lexicon allows the quantity of data to be small to demonstrate typically good and closure properties. Sample corpus: Sample corpus (e.g., Zurich Corpus of English Newspapers) is one of the categories of special corpus, which is made with samples containing finite collection of texts chosen with great care and studied in detail. Once a sample corpus is developed it is not added to or changed in any way (Sinclair 1991: 24) because any kind of change will imbalance its constitution and distort research requirement. Samples are small in number in relation to texts, and of constant size. Therefore, they do not qualify as texts. Literary corpus: A special category of sample corpus is literary corpus, of which there are many kinds. Classification criteria considered for generation of such corpus include author, genre (e.g., odes, short stories, fictions, etc.), period (e.g., 15th century, 18th century, etc.), group (e.g., Romantic poets, Augustan prose writers, Victorian novelists, etc.), theme (e.g., revolutionary writings, family narration, industrialisation, etc.) and other issues as valued parameters. Monitor corpus: Monitor corpus (e.g., Bank of English) is a growing, non-finite collection of texts with scope for constant augmentation of data reflecting changes in language. Constant growth of corpus reflects change in language, leaving untouched the relative weight of its components as defined by parameters. The same composition schema is followed year by year. The basis of monitor corpus is of reference to texts spoken or written in one single year (Sinclair 1991: 21). From monitor corpus we find new words, track variation in usage, observe change in meaning, establish long-term norm of frequency distribution, and derive wide range of lexical information. Over time the balance of components of a monitor corpus changes because new sources of data become available and some new procedures enable scarce material to become plentiful. The rate of flow is adjusted from time to time.
5 following some predefined parameters. Size, content, and field may vary from corpus to corpus, which is not permitted in case of parallel corpus. Multilingual corpus: Multilingual corpus (e.g., Crater Corpus) contains representative collections from more than two languages. Generally, here as well as in bilingual corpus, similar text categories and identical sampling procedures are followed although texts belong to different languages.
6 composition pattern but there is no agreement on the nature of similarity, as there are few examples of comparable corpora. It is indispensable for comparison in different languages and in generation of bilingual and multilingual lexicons and dictionaries. Opportunistic corpus: An opportunistic corpus stands for inexpensive collection of electronic texts that can be obtained, converted, and used free or at a very modest price; but is often unfinished and incomplete. Therefore, users are left to fill in blank spots for themselves. Their place is in situations where size and corpus access do not pose a problem. The opportunistic corpus is a virtual corpus in the sense that selection of an actual corpus (from opportunistic corpus) is up to the needs of a particular project. Monitor corpus generally considered as opportunistic corpus.
There can be some other types of specification such as closed corpus, synchronic corpus, historical corpus, dialect corpus, idiolect corpus, and sociolect corpus, etc. Therefore, the scheme of classification presented here is not absolute and final. It is open for re-categorisation as well as for sub-classification according to different parameters.
N. S. Dash: Corpus Linguistics: A General Introduction. CIIL, Mysore, 25th August 2010
7 the American National Corpus, etc. are indeed very large in size each one containing more than hundred million words.
8 distance (e.g., British English and American English). In this case, we need to tread on a different track. It is better to recognise and generate separate corpus with separate lexical items and syntactic constructions that are common in, or typical of, the native speakers especially those which differ from one another (e.g., words and sentences typical to British English vs. words and sentences typical to American English). We also need to get into the things that are correct by the rules of grammar and usage of the American English, and perfectly understandable; but just not right in rules of grammar and usage in the British English. This usually betrays the most proficient native speaker of American English the opportunity for enlisting their languages in corpus of British English. In this context let us consider about the Indian English. When Indian people are exposed to lots of linguistic materials that show the marks of being non-Indian English (as Indians are often exposed to many texts of the British English and the American English), people who want to describe, recognise, understand, and generate Indian English will definitely go for the texts produced by native speakers of Indian English, which will highlight the linguistic traits typical to Indian English, and thus will defy the all pervading influence of the British English or the American English over the Indian English.
Table 1: Type of corpus users and their needs with regard to the type of corpus
N. S. Dash: Corpus Linguistics: A General Introduction. CIIL, Mysore, 25th August 2010
N. S. Dash: Corpus Linguistics: A General Introduction. CIIL, Mysore, 25th August 2010
10
The process of data input is indirectly based on the method of text sampling. We can use two pages after every ten pages from a book. This makes a corpus best representative of data stored in a physical text. For instance, if a book has several chapters, each chapter containing different subject matters written by different writers, then the text samples collected in this process from all chapters are properly represented. Each text file should have a Header which contains metadata the physical information about the texts such as genre of the text (e.g., literature, science, commerce, technology, engineering, etc.), type of text (e.g., literature, story, travelogue, humour, etc), sub-type of text (e.g., fiction, historical, social, biographical, science fiction, etc.), name of book, name of the author(s), name of the editor(s), year of publication, edition number, name of the publisher, place of publication, number of pages taken for input, etc. This information is required for maintaining records and dissolving copyright problems. It is also advantageous to keep detailed records of the materials so that the texts are identified on grounds other than those, which are selected as formatives of corpus. Information whether the text is a piece of fiction or non-fiction, book, journal or newspaper, formal or informal etc. are useful for both linguistic and non-linguistic studies. At time of input of text, the original text of the physical source must be kept unchanged. After a paragraph is entered, one blank line should be given before a new paragraph starts. When texts are collected in a random sampling manner, a unique mark or flag is needed to be posted at the beginning of a new sample of text.
11 one needs schemes for its regular maintenance and augmentation. There are always some errors to be corrected, some modifications to be made, and some adjustment to be made, and some improvements to be implemented. The adaptation to new hardware and software technology and the change in requirements of the users are also to be taken care of. In addition to this, there has to be constant vigilance on the retrieval tasks as well as on the processing and analytic tools to be applied on the corpus. At present, the computer technology is developed to such an extent that executing these tasks with full satisfaction is no more a dream. But this solicits for more care in handling of the digital databases, since we know that the more powerful weapon we have in our hands, the more care is needed in their use and application to avoid unwanted damages on the resources.
To remove spelling errors, we need to thoroughly check the corpus and compare it with the actual physical data source, and do manual corrections. Care has to be taken to ensure that the spelling of words used in the corpus must resemble with the spelling of words used in the source texts. Also, it has to be checked if words are changed, repeated or omitted, punctuation marks are properly used, lines are properly maintained, and separate paragraphs are made for each text. Besides error correction, we have to verify the omission of foreign words, quotations, dialectal forms, etc. after generation of a corpus. The naturalised foreign words are, however, allowed to enter into the corpus. Others should be omitted. Dialectal variations are allowed. Punctuation marks and transliterated words are faithfully reproduced. Usually, books on natural and social sciences contain more foreign words, phrases and sentences than books of stories or fiction. Similarly, quotations from other languages, poems, songs, mathematical expressions, chemical formulae, geometric diagrams, images, tables, pictures, figures, flow-charts and similar symbolic representations of the source texts are not entered into corpus. All kinds of processing and reference works become easier and authentic if corpus is properly edited and errors are removed.
N. S. Dash: Corpus Linguistics: A General Introduction. CIIL, Mysore, 25th August 2010
12
7. Corpus Processing
The need for corpus processing arises after the generation of a corpus. We need to devise systems, tools, techniques, and software for accessing the language data and for extracting relevant information from the corpus. Corpus processing is indispensable not only for mainstream linguistic research and development activities but also for language technology works. There are various corpus processing techniques, such as statistical analysis, concordance, lexical collocation, key-word search, local-word-grouping, lemmatisation, morphological processing and generation, chunking, word processing, parts-of-speech tagging, annotation, parsing, etc. It has been observed that the results obtained from corpus processing often contradict intuitions about a language and its properties. There are many corpus processing software available for English, French, German, and similar other languages. For the Indian languages, however, there are only a few. We need to design corpus processing tools for our own languages keeping the nature of Indian languages in mind. Here I discuss in brief some well-known corpus processing techniques and tools actively used in English and other European languages. I describe these with close reference to English. Reference to Bengali and other Indian languages are made as and when necessary.
13 often each different item occurs in it. A frequency list of words is a set of clues to texts. By examining the list we get an idea about the structure of text and can plan an investigation accordingly. On the other hand, the alphabetically sorted word list is used for simple general references. A frequency list in alphabetical order plays a secondary role because it is used only when there is a need to check frequency of a particular item. However, it is useful as an object of study as it is often helpful in formulating hypotheses to be tested, and checking assumptions that have been made before hand Kjellmer (1984). Before we initiate frequency counting on the Indian language corpora, we need to take decisions about the process of dealing with the characters, words, idioms, phrases, clauses and sentences used in the corpus. These will restrain us from false observations and wrong deductions about the various linguistic properties of the languages.
7.3 Concordance
The process of concordance refers to making an index to words used in a corpus. It is a collection of occurrences of words, each in its own textual environment. Each word is indexed with reference to the place of its occurrences in the texts. It is indispensable strategy because it gives access to many important language patterns in the texts. It provides information not accessible via intuitions. There are some concordance software available for analysing corpus, e.g., MonoConc for sorting and frequency, ParaConc for parallel texts processing, Conc for sorting and frequency counting, Free Text for processing, sorting, etc. Concordance is most frequently used for lexicographical works and for language teaching. One can use it to search out single as well as multiword strings, words, phrases, idioms, proverbs, etc. It is also used to study lexical, semantic, syntactic patterns, text patterns, genre studies, and style patterns of texts (Barlow 1996). It is an excellent tool for investigating words and morphemes, which are polysemous and have multiple functions in a language.
14 of two, three or four words on either side of the word at the centre. This pattern may vary according to ones need. At the time of analysis of words, phrases, and clauses it is agreed that additional context is needed for better understanding. It is better to think KWIC as a text in itself, and examine frequency of words in the environment of the central word. It is not that all information is needed every time but we utilise information when we require. After analysing a corpus by KWIC we can formulate various objectives in linguistic description and devise procedures for pursuing these objectives. KWIC helps to understand the importance of context, role of associative words, actual behaviour of words in contexts, actual environment of occurrence, and if any contextual restriction is present in the use of a word. For instance, KWIC on the Bank of English shows that the most frequently used verb in reflexive form is find followed by see, show, present, manifest and consider all of which involve 'viewing' of a representation or proposition.
N. S. Dash: Corpus Linguistics: A General Introduction. CIIL, Mysore, 25th August 2010
15 Surface form Word-class Root part Suffix part Aspect marker Particle marker Auxiliary marker Tense marker Person marker Honorific marker Number marker Meaning : baleichilaam : Finite Verb : bal: -eichilaam : -e: -i- (emphatic) : -ch: -il(a) (past) : -aam (1st) : Null : Null (Sng./Pl.) : "I/we had said"
People working on native language can have better results since intuitive knowledge helps in finding out right root or suffix part from the inflected words, which may be beyond the grasp of nonnative users.
In this table the text is read downwards, with grammatical tags on the left, and the word sense tags on the right. Semantic tags are composed of an upper case letter indicating general discourse field, a
N. S. Dash: Corpus Linguistics: A General Introduction. CIIL, Mysore, 25th August 2010
16 digit indicating a first subdivision of the field, a decimal point followed by a further digit to indicate a finer subdivision, one or more 'pluses' or 'minuses' to indicate a positive or negative position on a semantic scale, etc. For example, A4.2+ indicates a word in category 'general and abstract words' (A), subcategory 'classification' (A4), sub-subcategory 'particular and general' (A4.2), and 'particular' as opposed to 'general' (A4.2+). Likewise, E2+ belongs to category 'emotional states, actions, events and processes' (E), subcategory 'liking and disliking' (E2), and refers to 'liking' rather than 'disliking' (E2+).
N. S. Dash: Corpus Linguistics: A General Introduction. CIIL, Mysore, 25th August 2010
17
7.11 Lemmatisation
The process of lemmatisation is related to identification of inflected words used in a piece of text, and reducing to their respective lexemes or lemma form. It allows researchers to extract and examine all the variants of particular LEMMA without having to input all the possible variants, and to produce frequency and distribution information for the lemma. It is useful for language teaching where the learners are trained to identify the total number of surface forms of a lemma. It is used to know which are inflected, how many times these are inflected, and in which way these are inflected, and so on. A part of the Brown Corpus contains the lemmatised forms of words along with all lexical and grammatical information. No Indian language corpus has been put to lemmatisation, as yet.
7.12 Annotation
Apart from pure texts, corpus is provided with additional linguistic information, known as annotation. Information is of different nature, such as part-of-speech, prosodic, semantic, anaphoric, discoursal annotation, etc. Annotated corpus is a very useful tool for research. Grammatically tagged corpus is the most common form of annotated corpus where words are assigned a word class. The Brown Corpus, LOB Corpus, and BNC are grammatically annotated, the LLC is prosodically annotated, while the Susanne Corpus is syntactically annotated. We are yet to start the work of annotation on the Indian language corpora. (a) Part of speech annotation: In part-of-speech annotation the aim is to assign to each lexical unit in the text a code indicating its part-of-speech. It increases specificity of data retrieval from a corpus, and helps in syntactic parsing, and semantic field annotation. It allows us to distinguish between the homographs. (b) Anaphoric annotation: In anaphoric annotation scheme, all pronouns and noun phrases are coindexed within a broad framework of cohesion. Here different types of anaphora are listed and sorted. Such an annotation scheme is used for studying and testing mechanisms like pronoun resolution. It is important for text understanding and machine translation. (c) Prosodic Annotation: In prosodic annotation the patterns of intonation, stress and pauses in speech are indicated. It is a more difficult type of annotation because prosody is considerably more impressionistic in nature than other linguistic levels. It requires careful listening by a trained ear. The Lancaster/IBM Spoken English Corpus is a prosodically annotated corpus where stressed syllables with and without independent pitch movement are marked with different symbols. All unstressed syllables, whose pitch is predictable from tone marks of surrounding accented syllables, are left unmarked. (d) Semantic Annotation: In semantic annotation either the semantic features of words in a text (essentially, annotation of word senses) or the semantic relationships between the words in text (e.g., agents or patients of particular actions) are marked. There is no unanimously agreed norm about which semantic features ought to be annotated. Some propose (Garside, Leech, and McEnery 1997) to use Rogets Thesaurus where words are organised into general semantic categories. Such annotation scheme is designed to apply to both open-class (content words) and closed class of words, as well as proper nouns, which are marked by a tag and set aside from statistical analysis. (e) Discoursal Annotation: In discoursal annotation a corpus is annotated at the levels of text and discourse, and is used in linguistic analysis. Despite its potential role in analysis of discourse this kind of annotation has never been widely used, possibly because linguistic categories are contextdependent, and their identification in texts is a greater source of dispute than other forms of
N. S. Dash: Corpus Linguistics: A General Introduction. CIIL, Mysore, 25th August 2010
18 linguistic phenomena. Some have (Stenstrm and Andersen 1996) annotated the London-Lund Spoken Corpus with 16 discourse tags to observe trends in teenage talk.
7.13 Parsing
Parsing is actually related to the automatic analysis of texts according to a grammar (Barnbrook 1998: 170). Technically, it is used to refer to practice of assigning syntactic structure to a text (McEnery and Wilson 1996: 178). It is usually performed after the basic morphosyntactic categories have been identified within a text. Based on different grammars (e.g., dependency grammar, context free phrase structure grammar, systematic functional grammar, extended affix grammar, etc.) parsing brings these morphosyntactic categories into higher level syntactic relationships with one another. Sentence level parsing involves automatic context-based as well as context-free syntactic analysis using information acquired from word-level processing. A parsed corpus is known as treebank, because it alludes to tree diagrams used in parsing. The visual diagram of tree structure is rarely found in corpus annotation. Generally, identical information is represented using sets of labelled brackets. Thus, Pearl sat on a chair will appear in treebank in the following way: [S[NP Pearl_NP1 NP][VP sat_VVD [PP on_II [NP a_AT1 chair_NN1 NP] PP] VP] S] where morpho-syntactic information is attached to words by underscore characters while constituents are indicated by opening and closing square brackets annotated at the beginning and end with phrase type e.g. [S ... S]. Not all parsing systems are similar. The main differences are: (i) the number of constituent types, which a system employs, and (ii) the way in which constituent types are allowed to combine with each other. However, despite these differences, majority of parsing schemes are based on a form of context-free phrase structure grammar. Within this system, a full parsing scheme aims at providing a detailed analysis of sentence structure, while a skeleton parsing scheme tends to use a less finely distinguished set of syntactic constituent types and ignores internal structure of certain constituent types. Parsing is most often postedited by human analysts because automatic parsing has a lower success rate than the part-of-speech annotation. The disadvantage of full manual parsing is inconsistency on behalf of analyst(s) engaged in parsing or editing corpus. To overcome this, more detailed guidelines are provided but even then ambiguities may occur when multiple interpretations are possible. Treebanks are language resource that provides annotations of natural languages at various levels of structure: at word level, phrase level, sentence level, and sometimes at the level of functionargument structure. Treebanks have become crucially important for developing data-driven approaches to natural language processing, human language technologies, grammar extraction, and linguistic research in general. There are a few on-going projects on compilation of representative treebanks for many European and USA languages. Implementation of such system on Indian corpora requires more time and research. Processing of corpus texts is of high importance to any language processing system that attempts to use natural language in some way or other. Advanced requirements of users raise need for efficient and widely applicable systems. Need for comprehensive processing capabilities has strong interface among theoretical, applied, and computational linguistics. Given the complexity of natural languages, it is always difficult for a machine to make accurate decisions about any property of a language. Therefore, occasional errors in processing should not be taken as a major road-block to research in language processing. An interactive computer program designed specifically for checking errors can make this process much faster and more reliable.
N. S. Dash: Corpus Linguistics: A General Introduction. CIIL, Mysore, 25th August 2010
19
8. Utility of Corpus
Unless defined otherwise, let us consider that a corpus should possess all the properties mentioned in Section 3. In essence, a corpus is an empirical standard, which acts as a benchmark for validation of usage of linguistic properties found in a language. If one analyses a corpus database, one can retrieve the following information about a language or variety. Information about all the properties and components used in a language, e.g., sounds, phonemes, intonation, letters, punctuations, morphemes, words, stems, bases, lemmas, compounds, phrases, idioms, set phrases, reduplications, proverbs, clauses, sentences, etc. Grammatical and functional information of letters, graphemes, allograpohs, morphemes, words, phrases, sentences, idiomatic expressions, proverbs, etc. relating to their structure, composition, patterns of using affixes and inflections, patterns of constituent structure, contexts of use, usage patterns, variations of contexts, etc. Usage-based information of letters, characters, phonemes, morphemes, words, compounds, phrases, sentences, etc. relating their descriptive, stylistic, metaphorical, allegorical, idiomatic, and figurative usages, etc. Extralinguistic information relating to time, place, situation, and agent of language events, socialcultural backgrounds of linguistic acts, life and living of target speech community, discourse and pragmatics, as well as of the world knowledge of the language users at large.
It is understandable that developing a corpus in accordance with these pre-conditions mentioned above is really a tough task. However, we can simplify the task to some extent if we redefine the entire concept of corpus generation based on object-oriented and work-specific needs. Since it is known that all types of corpus should not follow the same set of designing and composition principles we can have liberty to design a corpus keeping in mind the works we are planning to do with it (Dash 208: 47). The underlying proposition is that the general principles and conditions of corpus generation may vary depending on the purpose of a corpus developer or a user. Corpus linguistics is, however, not the same thing as obtaining language databases through the use of computer. It is the processing and analysis of the data stored within a corpus. The main task of a corpus linguist is not to gather databases, but to analyse these. Computer is a useful, and sometimes indispensable, tool for carrying out these activities.
9. Use of Corpus
There are a number of areas where language corpus is directly used as in language description, study of syntax, phonetics and phonology, prosody, intonation, morphology, lexicology, semantics, lexicography, discourse, pragmatics, language teaching, language planning, sociolinguistics, psycholinguistics, semiotics, cognitive linguistics, computational linguistics to mention a few. In fact, there is hardly any area of linguistics where corpus has not found its utility. This has been possible due to great possibilities offered by computer in collecting, storing, and processing natural language databases. The availability of computers and machine-readable corpora has made it possible to get data quickly and easily and also to have this data presented in a format suitable for analysis. Corpus as knowledge resource: corpus is used for developing multilingual libraries, designing course books for language teaching, compiling monolingual dictionaries (printed and electronic), developing bilingual dictionaries (printed and electronic), multilingual dictionaries (printed and electronic), monolingual thesaurus (printed and electronic version), various reference materials (printed and electronic version), developing machine readable dictionaries (MRDs), developing multilingual lexical resources, electronic dictionary (easily portable, can be duplicated as many
N. S. Dash: Corpus Linguistics: A General Introduction. CIIL, Mysore, 25th August 2010
20 copies as needed, can be modified easily for newer versions, can be customised according to need of users, can be ready and accessed easily, more durable than printed dictionary, etc.). Corpus in language technology: corpus is used for designing tools and systems for word processing, spelling checking, text editing, morphological processing, sentence parsing, frequency counting, item-search, text summarisation, text annotation, information retrieval, concordance, word sense disambiguation, WordNet (synset), semantic web, Semantic Net, Parts-of-Speech Tagging, Local Word Grouping, etc. Corpus for translation support systems: corpus is used for language resource access systems, Machine translation systems, multilingual information access systems, and cross-language information retrieval systems, etc. Corpus for human-machine interface systems: corpus is used for OCR, voice recognition, textto-speech, E-learning, on-line teaching, e-text preparation, question-answering, computer-assisted language education, computer-aided instruction, e-governance, etc. Corpus in speech technology: Speech corpus is used to develop general framework for speech technology, phonetic, lexical, and pronunciation variability in dialectal versions, automatic speech recognition, automatic speech synthesis, automatic speech processing, speaker identification, repairing speech disorders, and forensic linguistics, etc. Corpus in mainstream linguistics: corpus is used for language description, lexicography, lexicology, paribhasa formation, grammar writing, semantic study, language learning, dialect study, sociolinguistics, psycholinguistics, stylistics, bilingual dictionary, extraction, translation equivalents, generation of terminology databank, lexical selection restriction, dissolving lexical ambiguity, grammatical mapping, semiotics, pragmatic and discourse study, etc.
N. S. Dash: Corpus Linguistics: A General Introduction. CIIL, Mysore, 25th August 2010
21 use corpus in similar fashion to characterise different groups belonging to different class, race, creed, ethnicity, etc. (c) Among the media specialists, information retrievers use corpus to devise mechanisms for extracting appropriate information from bodies of text to build up linguistic knowledgebase, find information of items for indexing, and summarise important content of texts. Computational linguists use corpus to integrate their works with statistical regularities found in corpus, which work as an important key to analyse and process language. Also, corpus, as a source of data and knowledgebase is used for testing presence or absence of regularities in language, since statistical techniques become more effective when they work on outputs of some grammatically analysed corpora. Machine translators may access corpus to extract necessary and relevant linguistic information and verify efficiency of their systems, since corpus makes significant contribution to enhance actual capability of systems. Moreover, domain specific corpus enables systems to adopt self-organising approach to supplement traditional knowledge-based approaches. The language processing people benefit more and more from the development of corpus of various types, since both raw and annotated corpora are used in a large scale for developing language processors. It is suffice to say that corpus is a beneficial resource for all including researchers, technologists, writers, lexicographers, academicians, teachers, students, language learners, scholars, publishers, and others.
N. S. Dash: Corpus Linguistics: A General Introduction. CIIL, Mysore, 25th August 2010
22 (d) Lack of information from visual elements: Corpus does not contain graphs, tables, pictures, diagrams, figures, images, formulae and similar other visual elements, which are often used in a piece of text for proper cognition and understanding. A corpus devoid of such visual elements is bound to lose much of its information. (e) Other limitations: Corpus creation and research works are unreasonably tilted towards written texts, which reduce importance of speech. In reality, however, speech represents our language in a more reliable fashion than writing. The complexities of speech corpus generation make it a rare commodity. Thus, easy availability of text corpus and the lack of speech corpus inspire people to turn towards the text corpus. However, this does not imply that speech corpus has lost is prime position in corpus linguistics research. Moreover, language stored in corpus fails to highlight the social, evocative, and historical aspects of language. Corpus cannot define why a particular dialect is used as the standard one, how dialectal differences play decisive roles to establish and maintain group identity, how idiolect determines one's power, position and status in society, how language differs depending on domains, registers, etc. Corpus also fails to ventilate how certain emotions are evoked by certain poetic texts, songs and literature; how world knowledge and context play important roles to determine intended meaning of an utterance; how language evolve, divide, and merge with the change of time and society, etc.
N. S. Dash: Corpus Linguistics: A General Introduction. CIIL, Mysore, 25th August 2010
23 Dash, N.S. (2009) Corpus-based Analysis of the Bengali Language. Saarbrucken, Germany: Verlag Dr Muller Publications. Edwards, J.A. and Lampert, M.D. (Eds.) 1993. Talking Data: Transcription and Coding in Discourse Research. Hillsdale, N.J.: Lawrence Erlbaum Associates. Francis, W.N. and Kucera, H. 1964. Manual of information to accompany A standard Corpus of present-day edited American English. Dept. of Linguistics, Brown University, USA. Francis, W.N. and Kucera, H. 1982. Frequency Analysis of English Usage: Lexicon and Grammar. Boston: Houghton Mifflin. Fries, U., M, V. and Schneider, P. (Eds.) 1997. From Aelfric to the New York Times. Amsterdam: Rodopi. Garside, R., Leech, G. and McEnery, A. (Eds.) 1997. Corpus Annotation: Linguistic Information from Computer Text Corpora. London: Longman. Garside, R., Leech, G. and Sampson, G. (Eds.) 1987. The Computational Analysis of English: A Corpus Based Approach. London: Longman. Gerbig, A. 1997. Lexical and Grammatical Variation in a Corpus: A Computer-Assisted Study of Discourse on the Environment. London: Peter Lang Publishing. Ghadessy, M., Henry, A. and Roseberry, R.L. Eds. 2001. Small Corpus Studies and ELT: Theory and Practice. Amsterdam/ Philadelphia: John Benjamins. Granger, S. and Tyson, S.P. (Eds.) 2003. Extending the Scope of Corpus-Based Research: New Applications, New Challenges. Amsterdam: Rodopi. Granger, S., Hung, J. and Tyson, S.P. Eds. 2002. Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching. Amsterdam: John Benjamins. Greenbaum, S. (Ed.) 1996. Comparing English Worldwide: The International Corpus of English. Oxford: Clarendon. Greene, B. and Rubin, G. 1971. Automatic Grammatical Tagging of English. Technical Report. Department of Linguistics. Brown University, RI, USA. Halliday, M.A.K. 1987. Spoken and Written Modes of Meaning, Comprehending Oral and Written Language. San Diego, CA: Academic Press. Halliday, M.A.K. 1989. Spoken and Written Language. Oxford: Oxford University Press. Halliday, M.A.K. and Hasan, R. 1976. Cohesion in English. London: Longman. Halteren, H.V. (Ed.) 1999. Syntactic Word Class Tagging. Dordrecht: Kluwer Academic Press. Hickey, R. and Stanislaw, P. (Eds.) 1996. Language History and Linguistic Modelling: A Festschrift for Jacek Fisiak. Vol. 2. Berlin: Mouton de Gruyter. Hofland, K. and Johansson, S. 1982. Word Frequencies in British and American English. Bergen: Norway Computing Centre for the Humanities. Hunston, S. 2002. Corpora in Applied Linguistics. Cambridge: Cambridge University Press. Hutchins, W.J. 1986. Machine Translation: Past, Present, and Future. Chichester: Ellis Harwood. Ilson, R.F. (Ed.) 1986. Lexicography: An Emerging International Profession. Manchester: Manchester University Press. Jensen, J.T. 1990. Morphology: Word Structure in Generative Grammar. Amsterdam: John Benjamins. Jespersen, O. 1909-1949. Modern English Grammar on Historical Principles. 7 Vols. London: Allen and Unwin. Johansson, S. and Hofland, K. (Eds.) 1982. Computer Corpora in English Language Research. Bergen: Norwegian Computing Centre for the Humanities. Johansson, S. and Stenstrm, A-B. (Eds.) 1991. English Computer Corpora: Selected Papers and Research Guide. Berlin: Mouton de Gruyter. Katamba, F. 1993. Morphology. London: Macmillan Press. Kennedy, G. 1998. An Introduction to Corpus Linguistics. New York: Addison-Wesley Longman Inc. Kenny, A.J.P. 1982. The Computation of Style. Oxford: Pergamon Press. Kettemann, C.B. and Marko, G. (Eds.) 2002. Teaching and Learning by Doing Corpus Analysis. Language and Computers: Studies in Practical Linguistics 42. Amsterdam-Atlanta, GA.: Rodopi. Kilgariff, A. and J. Palmer (Eds.) 2000. Computer and the Humanities: Special Issue on Word Sense Disambiguation. Vol. 34. No.1. 2000.
N. S. Dash: Corpus Linguistics: A General Introduction. CIIL, Mysore, 25th August 2010
24 Kirk, J.M. (Ed.) 2000. Corpora Galore: Analyses and Techniques in Describing English. Amsterdam; Atlanta, GA: Rodopi. Kucera, H. and Francis, W.N. 1967. Computational Analysis of Present Day American English. Providence, RI: Brown University Press. Kyt, M., Ihalainen, O. and Rissanen, M. (Eds.) 1988. Corpus Linguistics, hard and soft: Proceedings of the 8th International Conference on English Language Research on Computerised Corpora. Amsterdam: Rodopi. Lancashire, I., E. Carol, and C.F. Meyer (Eds.) 1997. Synchronic Corpus Linguistics. Bergen, Norway: ICAME. Landau, S.I. 2001. Dictionaries: The Art and Craft of Lexicography. Cambridge: Cambridge University Press. Leech, G., G. Myers, and J. Thomas (Eds.) 1995. Spoken English on Computer: Transcription, Markup and Applications. Harlow: Longman. Levy, M. 1997. Computer Assisted Language Learning. Oxford: Oxford University Press. Ljung, M. (Ed.) 1997. Corpus-Based Studies in English. Papers from the 17th International Conference on English-Language Research Based on Computerized Corpora. Amsterdam: Rodopi. Macwhinney, B. 1991. The CHILDES Project: Tools for Analyzing Talk. Hillsdale, N.J.: Lawrence Erlbaum. Mair, C. and Hundt, M. (Eds.) 2000. Corpus Linguistics and Linguistics Theory. Amsterdam-Atlanta, GA: Rodopi. McArthur, T. 1981. Longman Lexicon of Contemporary English. London: Longman. McCarthy, J. 1982. Formal Problems in Semitic Phonology and Morphology. New York: Garland. McCarthy, M. 1998. Spoken Language and Applied Linguistics. Cambridge: cambridge University press. McEnery, T. and Wilson, A. 1996. Corpus Linguistics. Edinburgh: Edinburgh University Press. Mcenery, T., Rayson, P. and Wilson, A. (Eds.) 2002. A Rainbow of Corpora: Corpus Linguistics and the Langauges of the World. Mnchen: Lincom Europa. Meyer, C.F. 2002. English Corpus Linguistics. Cambridge: Cambridge University Press. Miller, G.A. 1951. Language and Communication. New York: McGraw-Hills. Nelson, G., Wallis, S. and Aarts, B. 2002. Exploring Natural Language: Working with the British Component of the International Corpus of English. Amsterdam/ Philadelphia: John Benjamins. Oakes, M.P. 1998. Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press. Ooi, V.B.Y. 1997. Computer Corpus Lexicography. Edinburgh: Edinburgh University Press Oostdijk, N. and deHann, P. (Eds.) 1994. Corpus Based Research into Language. AmsterdamAtlanta, GA: Rodopi. Partington, A. 1998. Patterns and Meanings - Using Corpora for English Language Research and Teaching. Amsterdam/Philadelphia: John Benjamins. Percy, C., Meyer, C.F. and Lancashire, I. (Eds.) 1996. Synchronic Corpus Linguistics. AmsterdamAtlanta, GA: Rodopi. Peters, B.P., Collins, P. and Smith, A. (Eds.) 2002. New Frontiers of Corpus Research. Language and Computers. Amsterdam-Atlanta, GA: Rodopi. Quirk, R., Greenbaum, S., Leech, G. and Svartvik, J. 1985. A Comprehensive Grammar of the English Language. London: Longman. Ravin, Y. and Leacock, C. (Eds.) 2000. Ploysemy: Theoretical and Computational Approaches. New York: Oxford University Press Inc. Schtze, H. 1997. Ambiguity Resolution in Language Learning: Computational and Cognitive Models. Cambridge: Cambridge University Press. Selting, M. and Couper-Kuhlen, E. (Eds.) 2001. Studies in Interactional Linguistics. Amsterdam/ Philadelphia: John Benjamins. Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press. Souter, C. and Atwell, E. (Eds.) 1993. Corpus Based Computational Linguistics. Amsterdam: Rodopi. Sperberg-Mcqueen, C.M. and Burnard, L. 1994. Guidelines for Electronic Text Encoding and Interchange. Chicago and Oxford: ACH-ACL-ALLC Text Encoding Initiative.
N. S. Dash: Corpus Linguistics: A General Introduction. CIIL, Mysore, 25th August 2010
25 Stenstrm, A-B, Andersen, G. and Hasund, I.K. 2002. Trends in Teenage Talk: Corpus Compilation, Analysis and Findings. Amsterdam: John Benjamins. Stubbs, M. 1996. Text and Corpus Analysis: Computer-Assisted Studies of Language and Culture. Oxford: Blackwell. Publishers. Summers, D. 1991. Longman/Lancaster English Language Corpus: Criteria and Design. Harlow: Longman. Svartvik, J. (Ed.) 1990. The London Corpus of Spoken English: Description and Research. Lund Studies in English 82. Lund: Lund University Press. Svartvik, J. (Ed.) 1992. Directions in Corpus Linguistics: Proceedings of Nobel Symposium 82 Stockholm, 4-8 August 1991. Berlin, New York: Mouton De Gruyter. Tannen, D. (Ed.) 1982. Spoken and Written Language: Exploring Orality and Literacy. Norwood, New Jersey: Ablex Publishing Corporation. Thomas, J. and Short, M. (Eds.) 1996. Using Corpora for Language Research: Studies in the Honour of Geoffrey Leech. London and New York: Addison Wesley Longman. Tognini-Bonelli, E. 2001. Corpus Linguistics at Work. Amsterdam: John Benjamins. Vera, D.E.J. (Ed.) 2002. A Changing World of Words: Studies in English Historical Lexicography, Lexicology and Semantics. Amsterdam: Rodopi. Vronis, J. (Ed.) 2000. Parallel Text Processing: Alignment and Use of Translation Corpora. Dordrecht: Kluwer Academic Publishers. Wichmann, A., Fligelstone, S., McEnery, T. and Knowles, G. (Eds.) 1997. Teaching and Language Corpora. London: Longman. Young, S. and G. Bloothooft (Eds.) 1997. Corpus-Based Methods in Language and Speech Processing. Vol-II. Dordrecht: Kluwer Academic Press.
N. S. Dash: Corpus Linguistics: A General Introduction. CIIL, Mysore, 25th August 2010