Orthographic measures of language distances between the official South African languages

Two methods for objectively measuring similarities and dissimilarities between the 11 official languages of South Africa are described. The first concerns the use of ngrams. The confusions between different languages in a text-based language identification system can be used to derive information on the relationships between the languages. Our classifier calculates n-gram statistics from text documents and then uses these statistics as features in classification. We show that the classification results of a validation test can be used as a similarity measure of the relationship between languages. Using the similarity measures, we were able to represent the relationships graphically. We also apply the Levenshtein distance measure to the orthographic word transcriptions from the 11 South African languages under investigation. Hierarchical clustering of the distances between the different languages shows the relationships between the languages in terms of regional groupings and closeness. Both multidimensional scaling and dendrogram analysis reveal results similar to wellknown language groupings, and also suggest a finer level of detail on these relationships. Ortografiese maatstawwe van taalafstande tussen die amptelike Suid Afrikaanse tale Opsomming Twee metodes vir die bepaling van verwantskappe tussen die 11 amptelike tale van Suid Afrika word beskryf. Die eerste metode maak gebruik van n-gramme. Die verwarrings wat plaasvind in 'n taalherkenningstelsel verskaf inligting oor die verhouding tussen die tale. N-gram statistieke word bepaal vanaf teksdokumente en word dan gebruik as kenmerke vir klassifikasie. Ons wys dat die uitsette van 'n bevestigingstoets gebruik kan word om te bepaal hoe naby tale aan mekaar lê. Vanuit hierdie metings het ons 'n sigbare voorstelling van die verhouding tussen tale afgelei.

Two methods for objectively measuring similarities and dissimilarities between the eleven official languages of South Africa are described.The first concerns the use of n-grams.The confusions between different languages in a text-based language identification system can be used to derive information on the relationships between the languages.Our classifier calculates n-gram statistics from text documents and then uses these statistics as features in classification.We show that the classification results of a validation test can be used as a similarity measure of the relationship between languages.Using the similarity measures, we were able to represent the relationships graphically.
We also apply the Levenshtein distance measure to the orthographic word transcriptions from the eleven South African languages under investigation.Hierarchical clustering of the distances between the different languages shows the relationships between the languages in terms of regional groupings and closeness.Both multidimensional scaling and dendrogram analysis reveal results similar to well-known language groupings, and also suggest a finer level of detail on these relationships.

Introduction
The development of objective metrics to assess the distances between different languages is of great theoretical and practical importance.To date, subjective measures have generally been employed to assess the degree of similarity or dissimilarity between different languages (Gooskens & Heeringa, 2004;Van-Hout & Münstermann, 1981;Van-Bezooijen & Heeringa, 2006), and those subjective decisions are for example, the basis for classifying separate languages, and certain groups of language variants as dialects of one another.It is without doubt that languages are complex; they differ in vocabulary, grammar, writing format, syntax and many other characteristics.This presents levels of difficulty in the construction of objective comparative measures between languages.Even if one intuitively knows for example, that English is closer to French than it is to Chinese, by how much is it closer?Also, what are the objective factors that allow one to assess these levels of distance?
These questions bear substantial similarities to the analogous questions that have been asked about the relationships between different species in the science of cladistics.As in cladistics, the most satisfactory answer would be a direct measure of the amount of time that has elapsed since the languages' first split from their most recent common ancestor.Also, as in cladistics, it is hard to measure this from the available evidence, and various approximate measures have to be employed instead.In the biological case, recent decades have seen tremendous improvements in the accuracy of biological measurements as it has become possible to measure differences between DNA sequences.In linguistics, the analogue of DNA measurements is historical information on the evolution of languages, and the more easily measured, though indirect measurements (akin to the biological phenotype) are either the textual or acoustic representations of the languages in question.
In the current article, we focus on distance measures derived from text; we apply two different techniques, namely language confusability based on n-gram statistics and the Levenshtein distance between orthographic word transcriptions, in order to obtain measures of dissimilarity among a set of languages.These methods are used to obtain language groupings, which are represented graphically using two standard statistical techniques (dendrograms and multi-dimensional scaling).This allows us to assess the methods relative to known linguistic facts in order to assess their relative reliability.
Our evaluation is based on the eleven official languages of South Africa.These languages fall into two distinct groups, namely the Germanic group (represented by English and Afrikaans) and the South African Bantu languages, which belong to the South Eastern Bantu group.The South African Bantu languages can further be classified in terms of different sub-groupings: Nguni (consisting of Zulu, Xhosa, Ndebele and Swati), Sotho (consisting of Southern Sotho, Northern Sotho and Tswana), and a pair that falls outside these sub-families (Tsonga and Venda).
We believe that an understanding of these language distances is of inherent interest, but also of great practical importance.For purposes such as language learning, the selection of target languages for various resources, and the development of human language technologies, reliable knowledge of language distances would be of great value.Consider, for example, the common situation of an organisation that wishes to publish information relevant to a particular multi-lingual community, but with insufficient funding to do so in all the languages of that community.Such an organisation can be guided by knowledge of language distances to make an appropriate choice of publication languages.
The following sections describe in more detail n-grams and Levenshtein distance.Thereafter we present an evaluation on the eleven official languages of South Africa, highlighting language groupings and proximity patterns.We close with a discussion of the results, interesting directions and a brief summary.

Theoretical background
Orthographic transcriptions are one of the most basic types of annotation used for speech transcription.Orthographic transcriptions of speech are important in most fields of research concerned with spoken language.The orthography of a language refers to the set symbols used to write a language and includes the writing system of a language.English, for example, has an alphabet of 26 letters for both consonants and vowels.However, each English letter may represent more than one ways to use orthographic distances for the assessment of language phoneme, and each phoneme may be represented by more than one letter.In the current research, we investigate two different similarities.

Language identification using n-grams
Text-based language identification (LID) is of great practical importance, as there is a widespread need to automatically identify the language in which documents are written.A typical application is web searching, where knowledge of the language of a document or web page is valuable information for presentation to a user, or for further processing.The general topic of text-based LID has consequently been studied extensively, and a spectrum of approaches has been proposed with the most important distinguishing factor being the depth of linguistic processing that is utilised.
Here we attempt to identify the languages by using simple statistical measures of the text under consideration.For example, statistics can be gathered from: • letter sequences (Murthy & Kumar, 2006); • presence of certain keywords (Giguet, 1995); • frequencies of short words (Grefenstette, 1995); or • unique or highly distinctive letters or short character strings (Souter et al., 1994).
Conventional algorithms from pattern recognition are then used to perform text-based LID based on these statistics.
N-gram statistics is a well known choice for building statistical models (Cavnar & Trenkle, 1994;Beesley, 1998;Padro & Padro, 2004;Kruengkrai et al., 2005;Dunning, 1994) We have shown elsewhere (Botha & Barnard, 2007) that several factors influence the accuracy of LID using n-gram statistics, and those factors are undoubtedly important in the current application as well.For the current research we have not searched for the optimal configuration to assess the relationships between languages; rather, as we report below, a reasonable configuration was selected and employed consistently.

Levenshtein distance
There are several ways in which phoneticians have tried to measure the distance between two linguistic entities, most of which are based on the description of sounds via various representations.This section introduces one of the more popular sequence-based distance measures, the Levenshtein distance measure.In 1995 Kessler introduced the use of the Levenshtein distance as a tool for measuring linguistic distances between dialects (Kessler, 1995).The basic idea behind the Levenshtein distance is to imagine that one is rewriting or transforming one string into another.Kessler successfully applied the Levenshtein algorithm to the comparison of Irish dialects.In this case the strings are transcriptions of word pronunciations.The rewriting is effected by basic operations, each of which is associated with a cost, as illustrated in Table 2.1 in the transformation of the string mošemane to the string umfana, which both are orthographic translations of the word boy in Northern Sotho and Zulu respectively.The Levenshtein distance between two strings can be defined as the least costly sum of costs needed to transform one string into another.In Table 2.1 the transformations shown are associated with costs derived from operations performed on the strings.The operations used were the deletion of a single symbol, the insertion of a single symbol, and the substitution of one symbol for another (Kruskal, 1983).The edit distance method was also taken up by Nerbonne et al., (1996), who applied it to Dutch dialects.Whereas Kruskal (1983) and Nerbonne et al. (1996) applied this method to phonetic transcriptions in which the symbols represented sounds, here the symbols are associated with alphabetic letters.Gooskens and Heeringa (2004) calculated Levenshtein distances between fifteen Norwegian dialects and compared them to the distances as perceived by Norwegian listeners.This comparison showed a high correlation between the Levenshtein distances and the perceptual distances.

Language grouping
In using the Levenshtein distance measure, the distance between two languages is equal to the average of a sample of Levenshtein distances of corresponding word pairs.When we have n languages, the average Levenshtein distance is calculated for each possible pair of languages.For n languages n x n pairs can be formed.The corresponding distances are arranged in a n x n matrix.The distance of each language with respect to itself is found in the distance matrix on the diagonal from the upper left to the lower right.
As this is a dissimilarity matrix, these values are always zero and therefore give no real information, so that only n x (n -1) distances are relevant.Furthermore, the Levenshtein distance is symmetric, implying that the distance between word X and word Y is equal to the distance between word Y and word X.This further implies that the distance between language X and Y is equal to the distance between language Y and X as well.Therefore, the distance matrix is symmetric.We need to use only one half which contains the distances of (n x (n -1))/2 language pairs.Given the distance matrix, groups of larger sizes are investigated.Hierarchical clustering methods are employed to classify the languages into related language groups using the distance matrix.
Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, bioinformatics, image analysis, data mining and pattern recognition.Clustering is the classification of similar objects into different groups, or more precisely, the partitioning of a data set into subsets, so that the data in each subset share some common trait according to a defined distance measure.The result of this grouping is usually illustrated as a dendrogram, a tree diagram used to illustrate the arrangement of the groups produced by a clustering algorithm (Heeringa & Gooskens, 2003).

Evaluation
This evaluation aims to present language groups of the eleven official languages of South Africa generated from similarity and dissimilarity matrices of the languages.These matrices are the results of ngram language identification and Levenshtein distance measurements respectively.The diagrams provide visual representations of the pattern of similarities and dissimilarities between the languages.

LID text data
Texts from various domains in all eleven South African languages were obtained from D.J. Prinsloo of the University of Pretoria and by using a web crawler (Botha & Barnard, 2005).The data included text from various sources (such as newspapers, periodicals, books, the Bible and government documents) and therefore, the corpus spans several domains.

Classification features
For either a fixed-length sample or an unbounded amount of text, the frequency counts of all n-grams were calculated.The characters that can be included in n-gram combinations were a space, the 26 letters of the Roman alphabet, the other 14 special characters found in Afrikaans, Northern Sotho and Tswana, and the unique combination 'n, which functions as a single character in Afrikaans.No distinction was made between upper and lower case characters.

Support vector machine
The support vector machine (SVM) is a non-linear discriminant function that is able to generalise well, even in high-dimensional spaces.The classifier maps input vectors to a higher dimensional space where a separating hyper-plane is constructed.The hyper-plane maximises the margin between the two datasets (Burges, 1998).In real-world problems data can be noisy and the classifier would usually over-fit the data.For such data, constraints on the classifiers are relaxed by introducing slack variables.This improves overall generalisation (Cristianini & Shawe-Taylor, 2005).
The LIBSVM (Chang & Lin, 2001) library provides a full implementation of several SVMs.The size of the feature space grows exponentially with n, which leads to long training times and extensive resource usage as n becomes large; we therefore limited our classification features to only 3-gram combinations.Thus the feature dimension of the SVM is equal to the number of 3-gram combinations.Two language models were built.The one model was built with samples of fifteen characters from a training set of 200 000 characters per language.The other model was built with samples of 300 characters using the same training set.For the fifteen character language models a sample contained the frequency count of each 3gram combination in the sample string of fifteen characters.For the 300 character model a sample similarly contains the frequency count of each 3-gram combination in the sample string of 300 characters.Samples of the testing set are created using the same character window (namely fifteen characters or 300 characters) as used to build the language model.After training the SVM language model the test samples can be classified according to language.
The SVM used a RBF kernel, and overlap penalties (Botha & Barnard, 2005) were employed to allow for non-separable data in the projected high-dimensional feature space.Sensible values for the two free parameters (kernel width (h = 1) and margin-overlap trade-off (C = 180, a large penalty for outliers)) were found on a small set of data.These "reasonable" parameters were employed throughout our experiments.Classification is done in a "one-against-one" approach in which k(k-1)/2 classifiers are constructed (in our case 55 classifiers are created) and each one trains from data of two different classes.Classification is done by a voting strategy.Each binary classification is considered to be a vote for the winning class.All the votes are tallied, and the test sample is assigned to the class with the largest number of votes.

Confusion matrix
In the confusion matrix below (Table 3.1), each row represents the correct language of a set of samples.The columns indicate the languages selected by the classifier.Thus, more samples on the diagonal axis of the matrix indicate better overall accuracy of the classifier, consequently generating a similarity matrix.It is clear that the higher values in the matrix reflect high levels of similarity between the paired languages.

A graphical representation of language distances
The confusion matrices provide a clear indication of the ways the languages group into families.These relationships can be represented visually using graphical techniques.Multidimensional scaling (MDS) is a technique used in data visualisation for exploring the properties of data in high-dimensional spaces.The algorithm uses a matrix of similarities between items and then assigns each item a location in a low dimensional space to match those distances as closely as possible.We used the confusion matrix to serve as similarity measure between languages, using the statistical package XLSTAT (XLSTAT, 2007).The confusion matrix was processed into a matrix of distances using the Pearson correlation coefficients between the rows, and input into the multidimensional scaling algorithm which mapped the language similarities in a 2-dimensional space.
Figure 3.1 shows the mapping that was created using the confusion matrix in Table 3.1.We can see that the languages from the same subfamilies group together.The mapping using the fifteen character text fragment shows a more definite grouping of the families than the mapping that uses the 300 character text fragment.In the fifteen character mapping the Nguni and Sotho languages are more closely related internally than the pair of Germanic languages and within the Nguni languages Swati is somewhat distant from the other three languages.As expected, Venda and Tsonga are consistently separated from the other nine languages.In conjunction with multidimensional scaling, dendrograms also provide a visual representation of the pattern of similarities or dissimilarities among a set of objects.We again used the confusion matrix, processed into a matrix of distances using the Pearson correlation coefficients to serve as similarity measure between languages, using the statistical package XLSTAT (XLSTAT, 2007).
Figure 3.2 illustrates the dendrograms derived from clustering the similarities between the languages as depicted by the confusion matrices in Table 3.1.The dendrogram using the fifteen character text fragment shows four classes representing the previously defined language groupings, Nguni, Sotho, Venda and Tsonga and English and Afrikaans.This dendrogram closely relates to the language groupings described in Heine and Nurse (2000).

Language grouping using Levenshtein distance
Levenshtein distances were calculated using existing parallel orthographic word transcriptions of sets of 50 and 144 words from each of the eleven official languages of South Africa.The data was manually collected from various multilingual dictionaries and online resources.Initially, 200 common English words, mostly common nouns easily translated into the other ten languages, were chosen.From this set, those words having unique translations into each of the other ten languages were selected, resulting in 144 words (and also a subset of 50 from the 144 words) that were used in the evaluations.

Distance matrix
Table 3.2 represents distance matrices, containing the distances, taken pair-wise, between the different languages as calculated from the summed Levenshtein distance between the 50 and 144 target words.In contrast to the confusion matrices, lower numbers in the matrices reflect less dissimilarity between the selected pair of languages.The distance matrices again contain n x (n -1)/2 independent elements in the light of the symmetry of the distance measure.

Visual representation
As above, the relationships between the languages for the matrices derived from the Levenshtein distance are represented visually in Figures 3.3 and 3.4 using graphical techniques.Again, multidimensional scaling is used.However, in this case the algorithm uses distance matrices of dissimilarities as opposed to the confusion matrices of similarities.The language dissimilarities are mapped onto a 2-dimensional space (Figure 3.3).
Figure 3.3 shows the mappings generated using the distance matrices in Table 3.2.Here also, though in different quadrants, the languages from the same subfamilies group together.The relative closeness within the Nguni and Sotho sub-families is not as clearly indicated in Figure 3.

Conclusions
We have seen that both confusion matrices between languages resulting from text-based language identification and Levenshtein distance matrices can be effectively combined with MDS and dendrograms to represent language relationships.Both methods reflect the known family relationships between the languages being studied.The main conclusion of this research is therefore that statistical methods, based on only orthographic transcriptions, are able to provide useful objective measures of language similarities.It is clear that these methods can be refined further using other inputs such as phonetic transcriptions or acoustic measurements; such refinements are likely to be important when, for example, fine distinctions between dialects are required.
Each approach has its advantages and disadvantages.Levenshtein distance measures do not require much data to perform a reasonable classification of the data.With as few as 50 words per language, reasonable classification is possible.Also, the process of generating the distance matrix is not computationally taxing.However, this method is seen to be less discriminating in assessing language similarities -from the historical record (Heine & Nurse, 2000) it is clear, for example that the tighter internal grouping of the Sotho and Nguni languages (as found with the LID-based approach) is more accurate.Similarly, the slightly larger separation of Swati from the other Nguni languages agrees with the anecdotal evidence on mutual intelligibility.
In a text-based LID system, high classification accuracy is a central goal.The size of the text fragment to be identified plays an important role in the accuracy achieved, since a larger text fragment can generally be identified more accurately.Hence, LID systems tend to use the longest text fragments available.However, for measuring language similarities, shorter text fragments may actually be preferable.In our experiments we found that the lower classification accuracy achieved on a smaller text fragment enables us to cluster the languages in a more discriminative fashion.
It would be most interesting to see whether closer agreement between these methods can be achieved by measuring Levenshtein distances between larger text collections -perhaps even parallel corpora rather than translations of word lists.Comparing these distance measures with measures derived from acoustic data is another pressing concern.Finally, it would be very valuable to compare various distance measures against other criteria for language simi- larity (e.g.historical separation or mutual intelligibility) in a rigorous fashion.

Figure
Figure 3.1: Multi-dimensional scale to represent similarities between languages calculated from the confusion matrices in Table3.1

Figure
Figure 3.2: Dendrogram calculated from the confusion matrices of Table3.1 Figure3.3 shows the mappings generated using the distance matrices in Table3.2.Here also, though in different quadrants, the languages from the same subfamilies group together.The relative closeness within the Nguni and Sotho sub-families is not as clearly indicated in Figure3.3(a) as in Figure3.3(b) or Figure3.1 (b), and the individual languages appear more spaced out in the quadrants.As before, Venda and Tsonga are consistently separated from the other nine languages.

FigureFigure 3
Figure 3.3: Multi-dimensional scale to represent dissimilarities between languages calculated from the distance matrix in Table3.2

Figure
Figure 3.4: Dendrogram calculated from the distance matrix of Table3.2 . An n-gram is a sequence of n consecutive letters.The n-grams of a string are gathered by extracting adjacent groups of n letters.The n-gram combinations in the string "example" are: In n-gram based methods for text-based LID, frequency statistics of n-gram occurrences are used as features in classification.The advantage is that no linguistic knowledge needs to be gathered to construct a classifier.The n-grams are also extremely simple to compute for any given text, which allows a straightforward trade-off between accuracy and complexity (through the adjustment of n) and have been shown to perform well in text-based LID and related tasks in several languages.