The African Wordnet Project (AWN) aims at building wordnets for five African languages: Setswana, isiXhosa, isiZulu, Sesotho sa Leboa (also referred to as Sepedi or Northern Sotho) and Tshivenda. Currently, the so-called expand model, based on the structure of the English Princeton WordNet (PWN), is used to continually develop the African Wordnets manually. This is a labour-intensive work that needs to be performed by linguistic experts, guided by several considerations such as the level of lexicalisation of a term in the African language. Up to now, linguists were responsible for identifying and translating appropriate synsets without much help from electronic resources because in the case of African languages even basic resources such as computer readable and electronic bilingual wordlists are usually not freely available. Methods to speed up the manual development of synsets and ease the workload of the human language experts were recently investigated. These centred around utilising the minimal amount of information available in bilingual dictionaries to identify synsets in the PWN that should be included in the AWN, transferring information from dictionaries to the wordnet and presenting the potential synsets to linguists for final approval and inclusion in the wordnets. In this article, we describe the methodology developed for building the African Wordnets, a potentially significant resource for natural language processing applications. Available resources that could be taken advantage of and resources that had to be developed are investigated, and initial results and future plans are explained.
A wordnet is an electronic lexical database consisting of words that are grouped into sets of synonyms called synsets and linked by conceptual-semantic and lexical relations (Miller
WordNet is a semantic dictionary that was designed as a network, partly because representing words and concepts as an interrelated system seems to be consistent with evidence for the way speakers organise their mental lexicons.
An example of a very comprehensive synset from the Princeton WordNet (PWN) (Princeton University,
The first sense for the noun synset ‘hand’ in PWN.
Since the 1990s, wordnets have been built for more than 150 languages worldwide, including many that are genetically and typologically unrelated to the original English wordnet. The first step towards creating cross-lingual wordnets was EuroWordNet (Vossen
Development of a wordnet typically follows one of two distinct methods, as discussed by Ordan and Wintner (
Although a wordnet is accessible to human users via a web browser for the study of lexical structure and lexicalisation patterns, wordnets are also an essential resource for natural language processing applications that, for instance, require lexical disambiguation. Semantic relations in a wordnet can be exploited for word sense discrimination (cf. the 2013 shared task for SemEval, reported on by Navigli, Jurgens & Vannella
In light of the foregoing description of wordnets as significant resources for natural language processing applications, the aim of this article is to present various strategies used for building the African Wordnets, as well as explaining the available resources that were taken advantage of and the resources that had to be developed in the case of these under-resourced languages. In the next section, we briefly present the African Wordnets and discuss their status quo. Then, various development strategies are discussed critically and illustrated with concrete examples. We conclude and point to future work in the final section.
The languages in this project are considered resource scarce compared to most other languages listed by the Global WordNet Association (
Words usually incorporate both prefixes and suffixes, and there can be several of each. This makes it hard to identify the root by mechanical means, as the root could be the first, second, third, or even a later morpheme in a word. The complexities involved are exacerbated by the fact that a considerable number of affixes, especially prefixes, have allomorphic forms.
Although prototypes of rule-based morphological analysers have been developed for the mentioned two languages, these are not freely available yet (cf. Bosch & Pretorius
The purpose of the African Wordnet Project (AWN) is the development of aligned wordnets for African languages spoken in South Africa (i.e. languages belonging to the Bantu language family) as multilingual knowledge resources which could be extended to include a wide variety of related languages also from other parts of Africa. Linking such wordnets to one another and to the many global wordnets makes cross-linguistic research and development possible. The first step towards developing such a rich resource for African languages was a training workshop for linguists, lexicographers and computer scientists which took place in 2007. As a direct result, development of wordnet prototypes for five official South African languages commenced as the AWN. Currently, the project includes isiXhosa, isiZulu, Setswana, Sesotho sa Leboa and Tshivenda
An example in DEBVisDic (seatla ~ hand).
Because of the resource scarceness of African languages, it was decided to follow the expand model for the development of the African Wordnet. As indicated by Ordan and Wintner (
Status quo of the African Wordnet Project.
Language | Synsets | Definitions | Usage examples |
---|---|---|---|
Northern Sotho (NSO) | 8412 | 1178 | 5253 |
Tshivenda (VEN) | 4270 | 209 | 4270 |
IsiZulu (ZUL) | 10 782 | 2179 | 5112 |
IsiXhosa (XHO) | 14 715 | 2198 | 7015 |
Setswana (TSN) | 15 803 | 3515 | 7203 |
The only other wordnet covering a South African language is for Afrikaans (Kotzé Princeton WordNet – 117 659 synsets for English ( FinnWordNet – 120 449 synsets for Finnish ( plWn – 178 000 synsets for Polish ( Chinese WordNet – 150 400 synsets ( MultiWordnet (
Given the data scarceness of the African languages, manual development has been the only feasible option for continued development up to now; however, experiments to exploit the relatively little data available for the five African languages in our project have already begun. The next section describes each of our development strategies in more detail.
When following the expand model in wordnet development, the most important consideration is to decide how to identify the concepts that should be included in the wordnet first. Because manual development is a slow and labour-intensive task, one would aim at including concepts that are used frequently and in a broad spectrum of domains. Such a resource can be found in the Princeton Core Concepts list (Available at
Lindén and Niemi (
During the first development phases, the AWN followed suit and used the extended Common Base Concepts list as well as the Princeton Core Concepts list to extract English synsets for linguists to include and translate into the African languages concerned. However, in the case of the AWN, it soon became clear that a more localised approach was needed. The seed lists described above contain many concepts that are not lexicalised in the African context. Linguists were forced to create new terminology or do time-consuming searches in their own text collections and online to match the foreign concepts. This not only resulted in many lengthy descriptions for unfamiliar terms being included as synonyms but also hampered progress as our inexperienced team were discouraged by the time-consuming work.
Some slight meaning differences between concepts in the African languages from those captured in PWN also came to light. The Setswana word
Developers of wordnets in other languages seemed to experience the same difficulties with non-lexicalised terminology (cf. Vincze & Almási
In
Examples of Princeton Core Concepts and Common Base Concepts not lexicalised in African languages.
Concepts list with ID in PWN | African language translations |
---|---|
NSO: diaparata |
|
ZUL: isiphihli |
|
NSO: mohuta wa kgano: 4 | |
NSO: tikologo ya mohlomphegi: 1 | |
ZUL: isikhundla seyeli: 1 |
SUMO (Suggested Upper Merged Ontology) and MILO (Mid-level Ontology) are used to organise synsets into common categories in a hierarchical manner. The Semantic Domains section discusses these categories in more detail.
Following the example of the HuWN (Vincze & Almási
Some linguists working on the AWN quickly gained confidence and a thorough understanding of the goal of a wordnet. These linguists realised that adhering to the lists discussed in the Base Concepts section would not result in a truly African database but instead be nothing more than a translation of foreign (European) terminology. Especially, the Setswana language team ventured off this list quickly and began including synsets that the researchers found interesting and applicable to other areas of their work. Linguists would start by including the most prototypical sense of a frequent word and allow this sense to guide them to the next. Because of this organic style, the Setswana wordnet includes many figurative meanings and unexpected relations in the ontology structure. The Setswana team also simultaneously included adjectives and verbs semantically related to the nouns that they were developing. One such example can be seen in the Setswana word
Because this style of expansion did not suit all linguists, we continued to supply lists of seed terms, but tried to extract those synsets in the PWN that had a solid localised base in the African languages. The most rudimentary way of doing this was to provide a continually updated list of terms that other languages in the AWN had already included in their wordnets. Languages such as isiXhosa and isiZulu that belong to the same language group show similar morphological and orthographic patterns as demonstrated by Pretorius and Bosch (
As mentioned previously, Tshivenda was later added as a fifth language when the development for the other four languages had already been under way for some time. To ease and speed up the development, the linguists were not only provided with the localised base concepts list as described in the Corpus Frequencies section, but also with the completed synsets from Sesotho sa Leboa as examples. The choice of language for this fast tracking was purely a practical consideration – both linguists working on Tshivenda also had a very good knowledge of Sesotho sa Leboa and felt confident to use this data when creating new synsets for their language. Tshivenda linguists, therefore, did not only have a list of lexicalised terms in English to start incorporating into their wordnet, but the seed list was further enhanced with the lemmas, usage examples and definitions in Sesotho sa Leboa. This approach worked exceptionally well and witnessed the Tshivenda wordnet grow to nearly 5000 synsets within 3 years.
As the project progressed, more resources became available, for instance, via the RMA (2013). These resources could be used to extract our own lists based on real-world parallel corpora for the languages included in AWN, and therefore allowed us to follow new methods. To test this approach, a multilingual parallel corpus, including all 11 official South African languages, was acquired from the RMA. The English version of the parallel corpus contained 50 000 tokens and was used to compare the African languages’ data with the Princeton Core Concepts. From the multilingual corpus, we extracted a frequency list for Tshivenda and compared the 5000 most frequent terms in the multilingual African wordlist with the list of (English) base and core concepts mentioned above.
This frequency list extracted from the above mentioned multilingual corpus includes concepts that reflect unique African language usage but are also skew in terms of domain representation. Most of the data in the parallel corpus were sourced from government domain web pages and freely available online newspapers. As the data were mostly sourced from the web, a platform that is quite new for the African languages, it also does not reflect older but still acceptable word forms. The domains included also do not provide many figurative interpretations (see Eiselen & Puttkammer
It should, however, be noted that the disjunctive orthography of Tshivenda lends itself to straightforward extraction of frequency lists from corpora, particularly in the case of nouns. Extraction of frequency lists in isiZulu and isiXhosa, with their conjunctive orthography, is more complex and requires preprocessing, such as morphological analysis of text corpora, to identify word roots (cf. Bosch & Pretorius
Development speed was significantly higher in the teams that focussed on lexicalised terminology and were following the direction each term led them in to exploit more meanings and senses. The organic growth style with seed lists that were extracted from local corpora suited the AWN team. Concurrently, some teams needed guidance in the form of seed lists more than others as the sheer quantity of work still to be done for AWN was overwhelming.
While individual methods and workflows were respected and linguists were encouraged to follow whichever method suited their situation (taking time constraints, research involvement and experience into consideration), the problem remained that we aimed to create an African wordnet where a significant set of synsets would at least be shared across all languages in the project. According to Anderson, Pretorius and Kotzé (
Niles and Pease (
This approach seemed to work especially well for encouraging collaboration among the language groups. Although the number of synsets developed did not increase significantly, constructive discussions dealing with the comparison of linguistic phenomena took place and resulted in various research outputs (cf. Mabusela
For the first 5 years of development, linguists were responsible for identifying and translating appropriate synsets without much help from electronic resources. Over that period, the African Wordnets only grew with an average of 1000 synsets per language per year (see Griesel & Bosch
Recently, research was done to speed up the manual development of synsets in the AWN in order to ease the workload of the human language experts. The investigations centred around utilising the minimal amount of information available in limited bilingual dictionaries to identify synsets in the PWN that could be included in the AWN semi-automatically. After identifying appropriate and still missing synsets, key pieces of information from the dictionary can be transferred to the wordnet presented to linguists for final approval and inclusion in the wordnets.
For the experiments described here, a few basic bilingual dictionaries that were made available for research purposes were used. These resources ranged in scope from a few hundred terms in a bilingual wordlist with little more than a translated lemma to a more comprehensive bilingual dictionary with at least a part of speech tag and some indication of the meaning. Many of the dictionaries were not in machine-readable format and required extensive proofreading to ensure a usable data source. The dictionaries were also often older manuscripts and therefore did not include newer terminology or word forms (cf. the Setswana-English dictionary by Brown [1925] that is freely available from
Similar studies using bilingual dictionaries have been conducted for a variety of languages. Oliver (
The biggest difference between these languages and the African languages, however, is the wealth and quantity of data contained in dictionaries which are freely accessible, often in machine-readable formats. Most of the entries in the dictionaries used in the above studies had at least searchable definitions and examples for each lemma. In the case of African languages, even basic resources like computer readable and electronic dictionaries are not always freely available. Given this resource scarceness, we had to develop a semi-automatic method of extracting possible synsets from the data listed above. It was decided to still include manual verification in the methodology as the data available were either very small or outdated and would, therefore, be more difficult to map to the PWN.
As described in the Introduction section, a basic synset is made up of a literal, a part of speech tag and the different semantic relations deemed necessary by the SUMO and MILO categorisation. It is also linked to the PWN by a unique identification code (ENG ID). By virtue of this ENG ID, the five African language wordnets are then connected to form a multilingual resource. Utilising the minimal amount of information available in the electronic resources listed above (sometimes as little as a lemma and its translation), we identified synsets in the PWN as potential links. One such example from the Sesotho sa Leboa dictionary is ‘almond tree’ with its translation
Examples of the spreadsheets used in the manual verification of the linking technique.
Language | English | English definition and identification | Match? (yes or no) | If Yes |
|
---|---|---|---|---|---|
NSO usage example | VEN usage example | ||||
moamandêlê | almond tree | ENG20-11896052-n: any of several small bushy trees having pink or white blossoms and usually bearing nuts | yes | Kenyo ya moamandele e na le koko ye e jewago. | - |
alefabete | alphabet | ENG20-06096415-n: a character set that includes letters and is used to write a language | yes ( |
Ge motho a nyaka go ngwala le go peleta ka nepagalo o swanetše go tseba alefabete. | - |
moagôtlhatlaganô | storey | ENG20-03243815-n: structure consisting of a room or set of rooms comprising a single level of a multilevel building | no ( |
- | |
agere | acre | ENG20-12847449-n: a unit of area (4840 square yards) used in English-speaking countries | yes | - | Tsimu ya Vho-Vele ndi khulu, ndi agere mbili |
volenga | arum lily | ENG20-11047703-n: South African plant widely cultivated for its showy pure white spathe and yellow spadix | yes | - | Maluvha a volenga a na muvhala mutshena |
babalasi | hangover | ENG20-13628315-n: disagreeable aftereffects from the use of drugs (especially alcohol) | yes | - | Denga o farwa nga babalasi nge a nwa halwa vhunzhi mulovha |
belekedzo | animal | ENG20-00012748-n: a living organism characterised by voluntary movement | no ( |
- | - |
The methodology was kept simple while utilising as much of the dictionaries as possible. Mappings where linguists marked ‘no’ will be evaluated at a later stage and might still be included in the wordnet, but linked to a different PWN counterpart. Although the addition of a manual verification step seems unnecessary given the promising results (see
New synsets added semi-automatically.
Language | Nouns in resource | Linked nouns | Successful links (%) |
---|---|---|---|
Setswana | 905 | 786 | 86.8 |
isiZulu | 382 | 345 | 90.3 |
isiXhosa | 1294 | 1108 | 85.6 |
Tshivenda | 5117 | 3218 | 62.8 |
Sesotho sa Leboa | 316 | 301 | 95.2 |
Total added | 8014 | 5758 | 71.8 |
In addition to the basic synset, each sense should be further enriched by a usage example in the target language showing the use of the sense in context. In most wordnets of languages which are highly resourced, usage examples are semi-automatically extracted from available (tagged) corpora as demonstrated, for instance, by Broda, Maziarz and Piasecki (
In the case of the AWN, linguists did not use corpora to find usage examples, but either created their own examples or translated the English usage examples if these were available. For example, in the case of:
buffet (ENG20-07108586-n), there is no usage example in the PWN, but in the Setswana synset, a usage example has been provided, viz. newspaper (ENG20-07573103-n), the English usage example is not translated into isiZulu, but a new usage example is created: standin
A first attempt was then made to fast-track the process for the isiZulu wordnet by using the recently compiled isiZulu Wortschatz corpus (Universität Leipzig
Extraction of usage example from online corpus.
The search on word form (
Much has been written about the resource scarceness of the African languages (cf. the resource audit performed by Grover, Van Huyssteen & Pretorius
The development strategies used for building a first version of the AWN for isiXhosa, isiZulu, Setswana, Sesotho sa Leboa and Tshivenda were described. Despite the varying strategies implemented to build wordnets for the five African languages concerned, the similarities shared on levels such as morphology or grammar and semantics allow the language teams to learn from one another, to share and thus to fast-track the development of the individual wordnets in this way.
The dilemma of under-resourced African languages further called for the implementation of various methods during the early stages of development. For example, much of the initial work was done manually following the expand model from the PWN. As the team gathered more experience and suitable lexical resources became available, more localised guidance could be given in the form of frequency-based seed terms and semi-automatic linking of lemmas from bilingual wordlists and the PWN. Experiments to speed up the collection of usage examples from online corpora also show promising results.
Each of the various development strategies plays a part in creating the unique AWN (
Summary of different components used in creating AWN.
During the development, a few questions also arose as to the best practise for handling conceptual and lexical gaps which exist between English and the African languages. Bentivogli and Pianta (
The AWN is currently undergoing extensive quality assurance. Aspects that need attention are to verify that the African language concept is linked to the correct English PWN synset and sense, spellchecking and normalisation of the content to ensure uniformity. Smrz (
Because of the limited availability of lexicographic and basic language resources for the African languages, wordnet construction presents a challenging and time-consuming manual task for linguists. As it stands, notwithstanding the different strategies implemented in the development of wordnets for the five African languages concerned, the AWN provides a solid base for future development of new synsets, expansion of the synsets with usage examples and definitions and inclusion of further African languages. It is foreseen that this first version will soon become a useful tool in the creation of more complex applications.
The authors acknowledge the South African National HLT Network, Department of Arts and Culture, and Women in Research Fund (University of South Africa) for providing funding in the various phases of the AWN project; Christiane Fellbaum (Princeton University) for constructive feedback on an earlier draft of this article; and the AWN Development Team for linguistic expertise.
The authors declare that they have no financial or personal relationship(s) that may have
S.E.B. was the project leader, linguistic coordinator and senior researcher in the African Wordnet Project. She conceptualised the idea for the research, extracted language-specific examples, analysed the relevant linguistic data and wrote large parts of the manuscript. M.G. was the project manager, technical coordinator and research assistant in the African Wordnet Project. She co-developed large parts of the manuscript and performed experiments of a technical nature.
The ISO 639-2 codes as found on