Challenges to Issues of Balance and Representativeness in African Lexicography * Thapelo Joseph Otlogetswe, Information Technology Research Institute, University of Brighton, Brighton, United Kingdom and
Challenges to Issues of Balance and Representativeness in African Lexicography * Thapelo Joseph Otlogetswe, Information Technology Research Institute, University of Brighton, Brighton, United Kingdom and Department of English, University of Botswana, Gaborone, Botswana Abstract: Modern dictionaries depend on corpora of different sizes and types for frequency listings, concordances and collocations, illustrative sentences and grammatical information. With the help of computer software, retrieving such information has increasingly become relatively easy. However, the quality of retrieved information for lexicographic purposes depends on the information input at the stage of corpus construction. If corpora are not representative of the different language usages of a speech community, they may prove to be unreliable sources of lexicographic information. There are, however, issues in African languages which make many African corpora questionable. These issues include a lack of texts of different genres, the unavailability of balanced and representative written texts, a complete absence of spoken texts as well as literacy problems in African societies. This article therefore explores the different challenges to the construction of reliable corpora in African languages. It argues that African languages face peculiar challenges and corpus research may require a different treatment compared to European and American corpus research. It finally concludes that issues of balance and representativeness appear theoretically impossible when looking at the results of sociolinguistic research on the different existing language varieties which are difficult to represent accurately in a corpus. Keywords: AFRICAN LANGUAGES, BALANCE, BANK OF ENGLISH, BORROWING, BRITISH NATIONAL CORPUS, COBUILD, CODE-SWITCHING, COMPUTERS, CORPORA, DIALECT, DICTIONARIES, FREQUENCY, LANGUAGE VARIETY, REPRESENTATIVENESS, SETSWANA, SOCIOLINGUISTICS, SPEECH, TEXT Opsomming: Uitdagings betreffende kwessies van balans en verteenwoordigendheid in Afrikaleksikografie. Moderne woordeboeke steun op korpusse van verskillende groottes en soorte vir frekwensielyste, konkordansies en kollokasies, voorbeeldsinne en taalkundige inligting. Met die hulp van rekenaarprogrammatuur het die herwinning van sulke inligting toenemend redelik maklik geword. Die gehalte van herwonne inligting vir leksikografiese doeleindes steun egter op die inligtingsinset by die korpusboufase. Indien korpusse nie verteenwoordigend is van die verskillende taalgebruike van 'n spraakgemeenskap nie, mag hulle blyk * This article is a revised version of a paper presented at the Eighth International Conference of the African Association for Lexicography, organised by the Department of German and Romance Languages, University of Namibia, Windhoek, Namibia, 7 9 July Lexikos 16 (AFRILEX-reeks/series 16: 2006): 146 Thapelo Joseph Otlogetswe onbetroubare bronne van leksikografiese inligting te wees. Daar is egter kwessies in Afrikatale wat baie Afrikakorpusse problematies maak. Hierdie kwessies sluit in die tekort aan tekste van verskillende genres, die niebeskikbaarheid van gebalanseerde en verteenwoordigende geskrewe tekste, die volkome afwesigheid van gesproke tekste asook geletterdheidsprobleme in Afrikagemeenskappe. Hierdie artikel ondersoek derhalwe die verskillende uitdagings betreffende die bou van betroubare Afrikataalkorpusse. Dit voer aan dat Afrikatale teenoor besondere uitdagings staan en korpusnavorsing 'n verskillende behandeling mag vereis in vergelyking met Europese en Amerikaanse korpusnavorsing. Ten slotte kom dit tot die gevolgtrekking dat kwessies van balans en verteenwoordigendheid teoreties onmoontlik lyk wanneer gekyk word na die resultate van sosiolinguistiese navorsing oor die verskillende bestaande taalvariëteite wat moeilik is om presies in 'n korpus te verteenwoordig. Sleutelwoorde: AFRIKATALE, BALANS, BANK OF ENGLISH, BRITISH NATIONAL CORPUS, COBUILD, DIALEK, FREKWENSIE, KODEWISSELING, KORPUSSE, ONTLENING, REKENAARS, SETSWANA, SOSIOLINGUISTIEK, SPRAAK, TAALVERSKEIDENHEID, TEKS, VERTEENWOORDIGENDHEID, WOORDEBOEKE Introduction More and more lexicographers realise the inevitability of using a corpus or corpora in the compilation of dictionaries. Leech (1991: 8) defines a corpus as a sufficiently large body of naturally occurring data of the language to be investigated . Renouf (1987: 1) refers to the use of computers in the storing and analysis of corpora in his definition: a collection of texts, of written or spoken words, which is stored and processed on computer for the purpose of linguistic research . McEnery and Wilson (1996: 24) similarly mention a reliance on computers in their definition of a corpus as a finite-sized body of machine-readable text, sampled in order to be maximally representative of the language variety under consideration . Leech (1991: 5), however, insists that a corpus has to be differentiated from an archive , the latter being a repository of available language materials, and the former being a systematic collection of material for given purposes. A corpus draws upon the resources of an archive and therefore both are important. The systematic compilation of a structured corpus however is the primary objective. Leech points to the systematicity of the collection of material as an important characteristic of a corpus. In this regard he does not conflate the substance for study with the tools used for its analysis and storage. However, whether the insistence on systematicity is crucial to the definition of a corpus may be subject to debate. Maybe corpus should be seen as textual data collected for linguistic research, usually stored in computers for quick analysis. But the fact that it is machine-readable, although important for its analysis, does not make it a corpus, for long before the introduction of computers there was much robust corpus research as exemplified by Kading's 1897 German corpus of some Challenges to Issues of Balance and Representativeness in African Lexicography million words for collating the frequency distribution of letters and sequences of letters. For ages, lexicographers contended with ways and means of producing authentic and reliable reflections of the lexicon. Most of these lexicographers depended on their ability to remember words existing in the languages under study, something that De Schryver and Prinsloo (2000: 219) call the random approach and Kilgarriff (2000: 109) the lexicographer's intuition . Others again, in the Oxford tradition, depended on readers, who searched texts for occurrences of words and submitted these for lemmatisation in the dictionary. For many years, these readers' contribution made the Oxford English Dictionary (OED) the unparalleled authority on the English language. More than any other English dictionary existing at the time, it included words from different genres and stylistic and regional varieties with reliable etymological information. Later developments in lexicography proved that readers were not very reliable sources of dictionary material since not only was their processing of data too slow, but it was also impossible for them to authoritatively deliver information on matters of frequency across texts and genres (see the Longman Dictionary of Contemporary English ( ), the Collins COBUILD English Dictionary ( ) or Kilgarriff (1997: 1)). Over the past 20 years, a rapid growth of corpus lexicography has been witnessed, which was championed and popularised, more than by any other group, by the COBUILD (Collins Birmingham University International Language Database) group in Birmingham, led by John Sinclair. The earlier Birmingham school of corpus lexicography adhered strictly to the corpus as a source of dictionary evidence (Sinclair 1987). It was argued that corpora were the sole source of lemmatisation, frequency information and word lists. If a word was not in a corpus, it was not recognised as legitimate dictionary material. However, as corpus lexicography develops, there is a greater focus on its composition. Issues of balance and representativeness are continuously engaging theoretical and practical lexicographers. Researchers want to know the kinds of texts forming corpora and in what percentage they exist. These questions and concerns are not trivial since they put the credence and reputation of a dependency on corpus lexicography in question. Therefore the greatest challenge lies not so much in what can be obtained from a corpus, but rather in its construction. Against this background, this article attempts to investigate the problems associated with the construction of corpora for dictionary making, particularly in many African contexts. It argues that some of the challenges facing the construction of robust corpora to be used in language research are the poverty of data, that is, the lack of texts to construct corpora representative of the different instances of language usage in a specific speech community. High illiteracy levels in African countries too pose great challenges to researchers hoping to collect written texts read by specific populations. Added to this, is the fact that, even where levels of literacy have increased, the literate members of a society 148 Thapelo Joseph Otlogetswe read and write texts written in English or French and not in their native languages. Even where such texts could be found in African languages, they mostly belong to a certain genre, like novels, plays and poetry, to the exclusion of other genres, like newspapers and academic texts. Even if the use of such data is attempted, the contention would still be with sanitised data, purified by the editorial policies and stylistic dictates of many publishing houses and newspaper offices, calling into question its authenticity as original and credible texts. The problem of representing speech still stands as one of the great challenges not only to African lexicographic research but also to research in many Western countries. At first, balance and representativeness must be investigated. Balance and Representativeness Most of the latest corpus-based lexicography researches consider issues of representativeness and balance (Ooi 1998) as marking standards of authenticity and robustness in corpus construction. A language corpus must be balanced and representative of the language from which it is extracted. By representativeness is meant the extent to which a sample [text] includes the full range of variability in a population (Biber 1993: 243), and as Summers (1993: 186) stresses unless the corpus is representative, it is ipso facto unreliable as a means of acquiring lexical knowledge . Therefore, for a corpus to be representative, it must reflect the typical cross-spectrum of language use of a defined language community or period (see Ooi 1998: 49). But Summers's (1993) claim will be returned to since it raises considerable difficulties, particularly for corpus building in many African contexts and for certain linguistic theories. A balanced corpus is one that includes proportions of a range of different text types of a language as they are reflected in the language studied. The problem of what constitutes balanced and representative corpora still remains controversial. The selection of language from different genres to include in the language database is largely unresolved. The compilation of text must finally capture language from a specified population from which a sample is taken, which reflects how that particular language community uses language. This is significant since, as Summers (1993: 186, 190) points out, the results of corpora analysis must be generalised to the language community from which the samples were abstracted. Kennedy (1998: 94) argues for a pedagogical purpose to corpus research by noting that high frequency of occurrence as determined by the analysis of texts should be a major determinant of lexical content of language instruction . In a way, it is clear that issues of balance and representativeness of corpora are related. A representative corpus must reflect a representation of different genres of language use in a language community, while a balanced corpus should attempt to capture those different percentage levels or ratios in the way they occur in the specified language community. This obviously is difficult Challenges to Issues of Balance and Representativeness in African Lexicography 149 to achieve, mainly because it is difficult to precisely know all the text types and their proportions of use in a population with its ever-changing dimensions. The difficulties are compounded when the building of a corpus of spoken language is attempted. As Kilgarriff (1997: 137) points out, dialectal varieties stand at different ratios to one another and should be represented within a corpus that attempts to accurately capture the language characteristics as a whole. There must also be contended with whether spoken texts can be accurately sampled and represented along the same lines as written texts. How many words are being looked for and what percentage of the spoken language do such words constitute? Whether spoken texts can be sampled in a representative manner is greatly questionable. Although a sample of Sengwaketse, Sekgatla, Sekwena and Sengwato can establish an acceptable representative percentage of the spoken form of these Setswana dialects, speech is a flood that refuses to be adequately accounted for numerically, for even when an attempt is made to quantify it, more of it is produced. It is Kennedy (1998: 62) who casts doubt on whether the representativeness of a corpus can confidently be argued for: In light of the perspectives on variation offered by several decades of research in discourse analysis and sociolinguistics, it is not easy to be confident that a sample of texts can be thoroughly representative of all possible genres or even of a particular genre or subject field or topic. By perspectives on variation Kennedy refers to different speech varieties existing in a speech community. Problems are faced with sampling the standard against non-standard varieties, various sociolects covering status, gender, ethnicity, age, occupation, and others, different regional varieties, like Sengwaketse, Sekgatla, Sekwena, and Sengwato in the case of Botswana, and different registers like casual, formal, technical and others. Such variations are difficult to represent in a corpus. By noting this difficulty, Kennedy does not imply that representativeness should not be attempted, but that perhaps theoretically an attempt at representativeness may not conclusively capture the nuances of existing varieties as outlined by linguistic research. Because of practical constraints, such as a shortage of time and money, the unavailability of machine-readable text, and copyright restrictions, it is not always possible to assemble the representative and balanced corpus ideally wanted. It is precisely these problems that stand out as some of the major stumbling blocks particularly in the African context of corpus construction. Two English Corpora This section will bring to the fore the composition of more influential corpora which have been considered by many lexicographers and numerous language researchers as examples of good corpora. What should particularly be noted is the percentage of spoken text against written text since it is central to subsequent arguments made in this article. 150 Thapelo Joseph Otlogetswe In 1991, COBUILD launched the Bank of English (BoE), which currently has over 450 million words and continues to grow as more material is published and deposited into it. It forms the basis for the compilation of the COBUILD dictionaries (Sinclair 1991). The BoE does not claim any balance or representativeness of usage, but it does claim to provide evidence of the way everyday English is used. The spoken word is represented by transcriptions of everyday casual conversation, radio broadcasts, meetings, interviews, discussions, etc. However, even with the seemingly impressive 450 million words, the BoE is only a small sample of human speech produced on a daily basis. The other corpus that has extensively been used is the British National Corpus (BNC) which has a 100 million collection of samples of written and spoken British English of the late twentieth century from a wide range of sources designed to represent a wide cross-section of current British English both written and spoken (BNC website). Ninety per cent of its composition consists of written texts including amongst other kinds of texts, extracts from regional and national newspapers, academic books and popular fiction, essays and letters (75% from informative writing such as fields of applied science and commerce and finance; 25% from imaginative, i.e. literary and creative, works). Spoken texts, which include unscripted informal conversation, government meetings and radio shows, constitute only 10%. The corpus has texts, of which 863 are transcribed from spoken conversation and monologues. It was developed by the Oxford University Press, the Longman Group Ltd, Chambers Harrap, the Unit for Computer Research on the English Language (Lancaster University), the Oxford University Computing Services, and the British Library Research and Development Department. It has been used for a wide variety of research in language, including lexicography, as in the making of the third edition of the Longman Dictionary of Contemporary English. The Primacy of Speech It is a widely held fact that children speak before they write and that speech is primary to human communication (Aitchison 1998). It is also generally agreed that in a speech community the spoken word exists in abundance compared to written texts. Taking these linguistic arguments as base and applying them by implication to issues of balance and representativeness, it can be concluded that if corpus construction has to reflect the different ratios between spoken and written texts, different text genres and various dialectal varieties, then the percentage of spoken language has to be much greater than that of written language in a corpus. Such a greater occurrence of spoken over written texts would approximate the ratios of written and spoken texts in the real world and would be likely to produce corpora that accurately represent language as used in a speech community. However, in none of the corpora discussed in the previous section the percentage of spoken texts exceed that of written texts. Ten per cent of the data of the BNC consists of spoken texts. Leech et al. (2001: 1) Challenges to Issues of Balance and Representativeness in African Lexicography 151 recognise the inadequacy of speech in the BNC which contains about 90 per cent written data and 10 per cent spoken data: Although spoken language, as the primary channel of communication, should by rights be given more prominence than this, in practice this has not been possible, since it is a skilled and very time-consuming task to transcribe speech into the computer-readable orthographic text that can be processed to extract linguistic information. In view of this problem, these proportions were chosen as realistic targets which, given the size of the BNC, are also sufficiently large to be broadly representative. According to Leech et al., the percentage of the speech text in the BNC was reached by determining what was possible to the compilers and not by making allowance for the proportion of speech to written language in a speech community. If corpora do not reflect in their composition that the spoken word is more common in real life than the written text, it calls the power and authority of corpora as sources of evidence for linguistic research in question and opens them to possible doubt. A Newspaper versus the Purchase of a Pair of Shoes While Kennedy (1998: 63) acknowledges the common occurrence of spe
