Currently, there are only a limited number of Japanese–Chinese bilingual corpora of a sufficient amount that can be used as training data for neural machine translation (NMT). In particular, there are few corpora that include spoken language such as daily conversation. In this research, we attempt to construct a Japanese–Chinese bilingual corpus of a certain scale by crawling the subtitle data of movies and TV series from the websites. We calculated the BLEU scores of the constructed WCC-JC (Web Crawled Corpus—Japanese and Chinese) and the other compared corpora. We also manually evaluated the translation results using the translation model trained on the WCC-JC to confirm the quality and effectiveness.
Movie and TV subtitles are frequently employed in natural language processing (NLP)applications, but there are limited Japanese-Chinese bilingual corpora accessible as a dataset to trainneural machine translation (NMT) models. In our previous study, we effectively constructed a corpusof a considerable size containing bilingual text data in both Japanese and Chinese by collectingsubtitle text data from websites that host movies and television series. The unsatisfactory translationperformance of the initial corpus, Web-Crawled Corpus of Japanese and Chinese (WCC-JC 1.0), waspredominantly caused by the limited number of sentence pairs. To address this shortcoming, wethoroughly analyzed the issues associated with the construction of WCC-JC 1.0 and constructed theWCC-JC 2.0 corpus by first collecting subtitle data from movie and TV series websites. Then, wemanually aligned a large number of high-quality sentence pairs. Our efforts resulted in a new corpusthat includes about 1.4 million sentence pairs, an 87% increase compared with WCC-JC 1.0. As aresult, WCC-JC 2.0 is now among the largest publicly available Japanese-Chinese bilingual corporain the world. To assess the performance of WCC-JC 2.0, we calculated the BLEU scores relative toother comparative corpora and performed manual evaluations of the translation results generated bytranslation models trained on WCC-JC 2.0. We provide WCC-JC 2.0 as a free download for researchpurposes only.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.