Although deep neural networks have recently led to great achievements in machine translation (MT), various challenges are still encountered during the development of Korean-Vietnamese MT systems. Because Korean is a morphologically rich language and Vietnamese is an analytic language, neither have clear word boundaries. The high rate of homographs in Korean causes word ambiguities, which causes problems in neural MT (NMT). In addition, as a low-resource language pair, there is no freely available, adequate Korean-Vietnamese parallel corpus that can be used to train translation models. In this paper, we manually established a lexical semantic network for the special characteristics of Korean as a knowledge base that was used for developing our Korean morphological analysis and word-sense disambiguation system: UTagger. We also constructed a large Korean-Vietnamese parallel corpus, in which we applied the state-of-the-art Vietnamese word segmentation method RDRsegmenter to Vietnamese texts and UTagger to Korean texts. Finally, we built a bi-directional Korean-Vietnamese NMT system based on the attentionbased encoder-decoder architecture. The experimental results indicated that UTagger and RDRsegmenter could significantly improve the performance of the Korean-Vietnamese NMT system, achieving remarkable results by 27.79 BLEU points and 58.77 TER points in Korean-to-Vietnamese direction and 25.44 BLEU points and 58.72 TER points in the reverse direction. INDEX TERMS Korean-Vietnamese machine translation, Korean-Vietnamese parallel corpus, lexical semantic network, morphological analysis, neural machine translation, word sense disambiguation. I. INTRODUCTION Neural machine translation based on the attention-based encoder-decoder model [1], [2] has emerged as the dominant paradigm in MT. It has achieved state-of-the-art performance in the translation of language pairs that have large amounts of training parallel corpora, such as English-French [3] and English-German [4]. However, it has shown poor translation quality in low-resource language pairs where training parallel corpora are scarce [5], [6]. Korean-Vietnamese is a low-resource language pair, and Korean-Vietnamese MT systems need to be built to serve The associate editor coordinating the review of this manuscript and approving it for publication was Yang Zhen.
Machine translation (MT) has recently attracted much research on various advanced techniques (i.e., statistical-based and deep learning-based) and achieved great results for popular languages. However, the research on it involving low-resource languages such as Korean often suffer from the lack of openly available bilingual language resources. In this research, we built the open extensive parallel corpora for training MT models, named Ulsan parallel corpora (UPC). Currently, UPC contains two parallel corpora consisting of Korean-English and Korean-Vietnamese datasets. The Korean-English dataset has over 969 thousand sentence pairs, and the Korean-Vietnamese parallel corpus consists of over 412 thousand sentence pairs. Furthermore, the high rate of homographs of Korean causes an ambiguous word issue in MT. To address this problem, we developed a powerful word-sense annotation system based on a combination of sub-word conditional probability and knowledge-based methods, named UTagger. We applied UTagger to UPC and used these corpora to train both statistical-based and deep learning-based neural MT systems. The experimental results demonstrated that using UPC, high-quality MT systems (in terms of the Bi-Lingual Evaluation Understudy (BLEU) and Translation Error Rate (TER) score) can be built. Both UPC and UTagger are available for free download and usage.
Since deep learning was introduced, a series of achievements has been published in the field of automatic machine translation (MT). However, Korean-Vietnamese MT systems face many challenges because of a lack of data, multiple meanings of individual words, and grammatical diversity that depends on context. Therefore, the quality of Korean-Vietnamese MT systems is still sub-optimal. This paper discusses a method for applying Named Entity Recognition (NER) and Part-of-Speech (POS) tagging to Vietnamese sentences to improve the performance of Korean-Vietnamese MT systems. In terms of implementation, we used a tool to tag NER and POS in Vietnamese sentences. In addition, we had access to a Korean-Vietnamese parallel corpus with more than 450K paired sentences from our previous research paper. The experimental results indicate that tagging NER and POS in Vietnamese sentences can improve the quality of Korean-Vietnamese Neural MT (NMT) in terms of the BiLingual Evaluation Understudy (BLEU) and Translation Error Rate (TER) score. On average, our MT system improved by 1.21 BLEU points or 2.33 TER scores after applying both NER and POS tagging to the Vietnamese corpus. Due to the structural features of language, the MT systems in the Korean to Vietnamese direction always give better BLEU and TER results than translation machines in the reverse direction.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.