On the development of a tagset for Northern Sotho with special reference to the issue of standardisation

E. Taljard; G. Faaß; U. Heid; D.J. Prinsloo

doi:10.4102/lit.v29i1.103

Original Research

On the development of a tagset for Northern Sotho with special reference to the issue of standardisation

E. Taljard, G. Faaß, U. Heid, D.J. Prinsloo

About the author(s)

E. Taljard, Department of African Languages, University of Pretoria, South Africa
G. Faaß, Department of African Languages, University of Pretoria, South Africa
U. Heid, Department of African Languages, University of Pretoria, South Africa
D.J. Prinsloo, Department of African Languages, University of Pretoria, South Africa

Full Text:

PDF (227KB)

Abstract

Working with corpora in the South African Bantu languages has up till now been limited to the utilisation of raw corpora. Such corpora, however, have limited functionality. Thus the next logical step in any NLP application is the development of software for automatic tagging of electronic texts. The development of a tagset is one of the first steps in corpus annotation. The authors of this article argue that the design of a tagset cannot be isolated from the purpose of the tagset, or from the place of the tagset and its design within the bigger picture of the architecture of corpus annotation. Usage-related aspects therefore feature prominently in the design of the tagset for Northern Sotho. It is explained why this proposed tagset is biased towards human readability, rather than machine readability; this choice of a stochastic tagger is motivated, and the relationship between tokenising, tagging, morphological analysis and parsing is discussed. In order to account at least to some extent for the morphological complexity of Northern Sotho at the tagging level, a multilevel annotation is opted for: the first level comprising obligatory information and the second optional and recommended information. Finally, aspects of standardisation are considered against the background of reuse, of sharing of resources, and of possible adaptation for use by other disjunctively written South African Bantu languages. It is not the aim of this article to evaluate the results of any tagging procedure using the proposed tagset. It only describes the design and motivates the choices made with regard to the tagset design. However, an evaluation is in process and results will be published in the near future (cf. Faaß et al., s.a.).

Keywords

NLP Application; Part-Of-Speech Tagging; POS-tagging; Tagset; Northern Sotho

Metrics

Total abstract views: 2956
Total article views: 2668

Crossref Citations

1. Quantitative analysis of Sesotho sa Leboa part-of-speech taggers
Dimakatso S Mathe, Roald Eiselen
South African Journal of African Languages vol: 41 issue: 3 first page: 259 year: 2021
doi: 10.1080/02572117.2021.2010921

African Online Scientific Information Systems (Pty) Ltd t/a AOSIS
Reg No: 2002/002017/07
International Tel: +27 21 975 2602
5 Hafele Street, Durbanville, Cape Town, 7550, South Africa
publishing(AT)aosis.co.za replace (AT) with @

All articles published in this journal are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, unless otherwise stated.
Website design & content: ©2024 AOSIS (Pty) Ltd. All rights reserved. No unauthorised duplication allowed.
By continuing to use this website, you agree to our Privacy Policy, Terms of Use and Security Policy.

________

Subscribe to our newsletter

Get specific, domain-collection newsletters detailing the latest CPD courses, scholarly research and call-for-papers in your field.

Literator | ISSN: 0258-2279 (PRINT) | ISSN: 2219-8237 (ONLINE)