Joon-Choul Shin scite author profile

et al. 2019

IEEE Access

Although deep neural networks have recently led to great achievements in machine translation (MT), various challenges are still encountered during the development of Korean-Vietnamese MT systems. Because Korean is a morphologically rich language and Vietnamese is an analytic language, neither have clear word boundaries. The high rate of homographs in Korean causes word ambiguities, which causes problems in neural MT (NMT). In addition, as a low-resource language pair, there is no freely available, adequate Korean-Vietnamese parallel corpus that can be used to train translation models. In this paper, we manually established a lexical semantic network for the special characteristics of Korean as a knowledge base that was used for developing our Korean morphological analysis and word-sense disambiguation system: UTagger. We also constructed a large Korean-Vietnamese parallel corpus, in which we applied the state-of-the-art Vietnamese word segmentation method RDRsegmenter to Vietnamese texts and UTagger to Korean texts. Finally, we built a bi-directional Korean-Vietnamese NMT system based on the attentionbased encoder-decoder architecture. The experimental results indicated that UTagger and RDRsegmenter could significantly improve the performance of the Korean-Vietnamese NMT system, achieving remarkable results by 27.79 BLEU points and 58.77 TER points in Korean-to-Vietnamese direction and 25.44 BLEU points and 58.72 TER points in the reverse direction. INDEX TERMS Korean-Vietnamese machine translation, Korean-Vietnamese parallel corpus, lexical semantic network, morphological analysis, neural machine translation, word sense disambiguation. I. INTRODUCTION Neural machine translation based on the attention-based encoder-decoder model [1], [2] has emerged as the dominant paradigm in MT. It has achieved state-of-the-art performance in the translation of language pairs that have large amounts of training parallel corpora, such as English-French [3] and English-German [4]. However, it has shown poor translation quality in low-resource language pairs where training parallel corpora are scarce [5], [6]. Korean-Vietnamese is a low-resource language pair, and Korean-Vietnamese MT systems need to be built to serve The associate editor coordinating the review of this manuscript and approving it for publication was Yang Zhen.

show abstract

Effect of Word Sense Disambiguation on Neural Machine Translation: A Case Study in Korean

et al. 2018

IEEE Access

Neural Machine Translation Enhancements through Lexical Semantic Network

et al. 2018

UPC: An Open Word-Sense Annotated Parallel Corpora for Machine Translation Study

et al. 2020

Applied Sciences

Machine translation (MT) has recently attracted much research on various advanced techniques (i.e., statistical-based and deep learning-based) and achieved great results for popular languages. However, the research on it involving low-resource languages such as Korean often suffer from the lack of openly available bilingual language resources. In this research, we built the open extensive parallel corpora for training MT models, named Ulsan parallel corpora (UPC). Currently, UPC contains two parallel corpora consisting of Korean-English and Korean-Vietnamese datasets. The Korean-English dataset has over 969 thousand sentence pairs, and the Korean-Vietnamese parallel corpus consists of over 412 thousand sentence pairs. Furthermore, the high rate of homographs of Korean causes an ambiguous word issue in MT. To address this problem, we developed a powerful word-sense annotation system based on a combination of sub-word conditional probability and knowledge-based methods, named UTagger. We applied UTagger to UPC and used these corpora to train both statistical-based and deep learning-based neural MT systems. The experimental results demonstrated that using UPC, high-quality MT systems (in terms of the Bi-Lingual Evaluation Understudy (BLEU) and Translation Error Rate (TER) score) can be built. Both UPC and UTagger are available for free download and usage.

show abstract

Korean-Vietnamese Neural Machine Translation with Named Entity Recognition and Part-of-Speech Tags

IEICE Trans. Inf. & Syst.

et al. 2020