Home > English > Resources > List of theses > List of theses published on Southeast Asia from January to May (...)

Issues and methods for the creation of corpora in under-resourced languages applied to the classification of texts for learning Burmese


Author: Wong, Jennifer
Under the direction of: Mathieu Valette et San San Hnin Tun
Langue française Texte français

Keywords : Burmese, Foreign language learning, Under-resourced languages, Corpus creation, Lexical frequency, Readability.


Read the thesis.


Finding reading material suitable for learners of less commonly taught languages is a common issue, both for learners and teachers. Natural language processing offers promising methods to facilitate the selection process. Since their implementation requires language specific training corpora, and such languages are also less well-resourced, corpus quality is even more important. We have found it necessary to take in to account not only the particularities of the language and how its writing system is computerized, but also the context of how the corpus is to be used, considering aspects such as orthography, studies in linguistics and lexicography, cultural aspects and even the teaching tradition, as students are probably more influenced by existing resources when they are scarce. This thesis looks at the application of a method for text evaluation for learners of Burmese as a foreign language. We detail the creation of two types of corpora : authentic texts and didactic resources, using the second type to inform how the authentic texts are segmented into minimal units of analysis or “words”, a necessary pretreatment as Burmese does not delimit words with spaces. We also take into account cultural aspects and ngram syllable frequency in training a dictionary-based segmentation tool. The authentic text corpora are then used to create a general lexical frequency list, using the averaged reduced frequency method to account for dispersion. This list is then used to create a support vector machine to order texts by increasing difficulty using solely lexical data, a method that is promising for less-resourced languages.