Uyghur language is an agglutinative language in which words are formed by suffixes attaching to a stem (or root). Because of the explosive nature in vocabulary of the agglutinative languages, several morpheme-based language models are built and experiments are implemented. Morpheme is the smallest meaning bearing unit. In this research, morpheme is referred to any of prefix, stem, or suffix. As a result, a large vocabulary ASR system is built on the basis of Julius system. Several ASR results on language models based on different units (word, morpheme, and syllable) are compared.Keywords-Uyghur, morpheme segmenter, language modeling ASR,
I. UYGHUR LANGUAGE AND MORPHOLOGICAL UNITSUyghur belongs to the Turkish language family of the Altaic language system. At present, Uyghur is written in Arabic scripts with some modifications. There are 32 phonemes in Uyghur, 8 vowels and 24 consonants; one phoneme is recorded by one character. Sentences in Uyghur consist of words, which are separated by space or punctuation marks. Uyghur words consist of some smaller morphological units without any splitter between them. (Example1 morpheme and syllable segmentation) Müshükning kۑlginini korgۑn chashqan hoduqup qachti. (The mouse seeing the coming cat was startled and escaped.) Müshük+ning kۑlgۑn+i+ni kor+gۑn chashqan hoduq+up qach+ti. (morpheme sequence) Mü+shük+ning kۑl+gi+ni+ni kor+gۑn chash+qan ho+du+qup qach+ti. (syllable sequence)The morpheme structure of Uyghur word is " prefix + stem + suffix1 + suffix2 + … ". A root (or stem) is attached in the rear by zero to many (longest is about 10 suffixes or more) suffixes. A few words can be added with a prefix (only one) in the head of a stem, and only 7 (difficult to find more) prefixes are used in this research. 108 suffix types are defined and collected, according to their semantic and syntactic functions, which can be extracted to 305 surface forms. The surface realizations of the morphological structure are constrained and modified by a number of language phenomenon such as insertion, deletion, phonetic harmony, and disharmony (vowel assimilation, vowel weakening [1][2]). Suffixes that make semantic changes to a root are derivational suffixes. Suffixes that make syntactic changes to a root are inflectional suffixes. A root linked with the derivational suffixes becomes a stem. So the root set is included in the stem set. Sometimes the words "stem" and "root" are used without distinguishing. To keep the versatile nature of language, we keep different segmentation forms of a same word in our training corpus.(Example2 different morpheme segmentation of the same word) oqutquchi (teacher{stem})= oqut(teach){root} + quchi(er) {suffix} yazghuchi = yaz(write)+ghuchi(er) hesablinidu = hesab+la+n+idu, hesab+lan+idu;
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.