Original Research

Orthographic measures of language distances between the official South African languages

P.N. Zulu, G. Botha, E. Barnard
Literator | Vol 29, No 1 | a106 | DOI: https://doi.org/10.4102/lit.v29i1.106 | © 2008 P.N. Zulu, G. Botha, E. Barnard | This work is licensed under CC Attribution 4.0
Submitted: 25 July 2008 | Published: 25 July 2008

About the author(s)

P.N. Zulu, Human Language Technologies Research Group, CSIR & Department of Electrical and Computer Engineering, University of Pretoria, South Africa
G. Botha, Human Language Technologies Research Group, CSIR & Department of Electrical and Computer Engineering, University of Pretoria, South Africa
E. Barnard, Human Language Technologies Research Group, CSIR & Department of Electrical and Computer Engineering, University of Pretoria, South Africa

Full Text:

PDF (194KB)

Abstract

Two methods for objectively measuring similarities and dissimilarities between the eleven official languages of South Africa are described. The first concerns the use of n-grams. The confusions between different languages in a text-based language identification system can be used to derive information on the relationships between the languages. Our classifier calculates n-gram statistics from text documents and then uses these statistics as features in classification. We show that the classification results of a validation test can be used as a similarity measure of the relationship between languages. Using the similarity measures, we were able to represent the relationships graphically.

We also apply the Levenshtein distance measure to the orthographic word transcriptions from the eleven South African languages under investigation. Hierarchical clustering of the distances between the different languages shows the relationships between the languages in terms of regional groupings and closeness. Both multidimensional scaling and dendrogram analysis reveal results similar to well-known language groupings, and also suggest a finer level of detail on these relationships.

Keywords

Clustering; Language Distances; Language Identification; Levenshtein Distance; N-Gram

Metrics

Total abstract views: 4170
Total article views: 3003

 

Crossref Citations

1. The South African directory enquiries (SADE) name corpus
Jan W. F. Thirion, Charl van Heerden, Oluwapelumi Giwa, Marelie H. Davel
Language Resources and Evaluation  vol: 54  issue: 1  first page: 155  year: 2020  
doi: 10.1007/s10579-019-09448-6

2. The impact of interlingual correspondences on cognate recognition in Slavic intercomprehension
Jacopo Saturno
Russian Linguistics  vol: 47  issue: 2  first page: 209  year: 2023  
doi: 10.1007/s11185-023-09276-x

3. Development of Indigenous Language Orthographies: Setting Up English as the Torch Bearer
Juniel Shoko Matavire
International Journal of Critical Diversity Studies  vol: 6  issue: 2  year: 2024  
doi: 10.13169/intecritdivestud.6.2.0095

4. Modeling Intelligibility of Written Germanic Languages: Do We Need to Distinguish Between Orthographic Stem and Affix Variation?
Wilbert Heeringa, Femke Swarte, Anja Schüppert, Charlotte Gooskens
Journal of Germanic Linguistics  vol: 26  issue: 4  first page: 361  year: 2014  
doi: 10.1017/S1470542714000166

5. Expanding the neighbourhood watch: Orthographic neighbours in isiXhosa reading and spelling
Paige S. Cox, Tracy N. Bowles
Reading & Writing  vol: 15  issue: 1  year: 2024  
doi: 10.4102/rw.v15i1.461

6. Language independent search in MediaEval's Spoken Web Search task
Florian Metze, Xavier Anguera, Etienne Barnard, Marelie Davel, Guillaume Gravier
Computer Speech & Language  vol: 28  issue: 5  first page: 1066  year: 2014  
doi: 10.1016/j.csl.2013.12.004