This study focuses on a method for sequential data augmentation in order to alleviate data sparseness problems. Specifically, we present corpus expansion techniques for enhancing the coverage of a language model. Recent recurrent neural network studies show that a seq2seq model can be applied for addressing language generation issues; it has the ability to generate new sentences from given input sentences. We present a method of corpus expansion using a sentence-chain based seq2seq model. For training the seq2seq model, sentence chains are used as triples. The first two sentences in a triple are used for the encoder of the seq2seq model, while the last sentence becomes a target sequence for the decoder. Using only internal resources, evaluation results show an improvement of approximately 7.6% relative perplexity over a baseline language model of Korean text. Additionally, from a comparison with a previous study, the sentence chain approach reduces the size of the training data by 38.4% while generating 1.4-times the number of ngrams with superior performance for English text.
Owing to the rising demand for second-language learning and the advances in machine learning, there has been increase in the need for spoken computer-assisted language learning (CALL) applications [1,2]. Moreover, with the spread of Korean popular culture overseas [3], the need for Korean language learning has prompted the development of such CALL applications for non-native Korean learners. Among the spoken Korean CALL applications, this paper focuses on an automatic speech recognition (ASR)-based proficiency assessment for non-native Korean speech. Non-native speech significantly degrades the performance of the ASR used in a spoken CALL owing to the pronunciation variabilities in non-native speech [4,5]. Consequently, numerous research results have been reported on automatic proficiency assessment methods for non-native speech that is read aloud [6-13] and for spontaneous speech [14-17]. However, there has been limited research on proficiency assessment of non-native Korean speech [18]. Moreover, most research has been focused on the analysis of pronunciation variabilities in non-native Korean speech. For instance, [19,20] analyzes the pronunciation variabilities of Korean spoken by Japanese and Chinese learners using contrastive and
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.