Jiannan Mao scite author profile

Jiannan Mao

2Publications

4Citation Statements Received

25Citation Statements Given

How they've been cited

How they cite others

Affiliations

Gifu University

Publications

Order By: Most citations

WCC-JC: A Web-Crawled Corpus for Japanese-Chinese Neural Machine Translation

Zhang

Tian²,

Mao

et al. 2022

Applied Sciences

View full text Add to dashboard Cite

Currently, there are only a limited number of Japanese–Chinese bilingual corpora of a sufficient amount that can be used as training data for neural machine translation (NMT). In particular, there are few corpora that include spoken language such as daily conversation. In this research, we attempt to construct a Japanese–Chinese bilingual corpus of a certain scale by crawling the subtitle data of movies and TV series from the websites. We calculated the BLEU scores of the constructed WCC-JC (Web Crawled Corpus—Japanese and Chinese) and the other compared corpora. We also manually evaluated the translation results using the translation model trained on the WCC-JC to confirm the quality and effectiveness.

show abstract

WCC-JC 2.0: A Web-Crawled and Manually Aligned Parallel Corpus for Japanese-Chinese Neural Machine Translation

Tian¹,

Mao

Han

et al. 2023

Electronics

View full text Add to dashboard Cite

Movie and TV subtitles are frequently employed in natural language processing (NLP)applications, but there are limited Japanese-Chinese bilingual corpora accessible as a dataset to trainneural machine translation (NMT) models. In our previous study, we effectively constructed a corpusof a considerable size containing bilingual text data in both Japanese and Chinese by collectingsubtitle text data from websites that host movies and television series. The unsatisfactory translationperformance of the initial corpus, Web-Crawled Corpus of Japanese and Chinese (WCC-JC 1.0), waspredominantly caused by the limited number of sentence pairs. To address this shortcoming, wethoroughly analyzed the issues associated with the construction of WCC-JC 1.0 and constructed theWCC-JC 2.0 corpus by first collecting subtitle data from movie and TV series websites. Then, wemanually aligned a large number of high-quality sentence pairs. Our efforts resulted in a new corpusthat includes about 1.4 million sentence pairs, an 87% increase compared with WCC-JC 1.0. As aresult, WCC-JC 2.0 is now among the largest publicly available Japanese-Chinese bilingual corporain the world. To assess the performance of WCC-JC 2.0, we calculated the BLEU scores relative toother comparative corpora and performed manual evaluations of the translation results generated bytranslation models trained on WCC-JC 2.0. We provide WCC-JC 2.0 as a free download for researchpurposes only.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jiannan Mao

WCC-JC: A Web-Crawled Corpus for Japanese-Chinese Neural Machine Translation

WCC-JC 2.0: A Web-Crawled and Manually Aligned Parallel Corpus for Japanese-Chinese Neural Machine Translation

Contact Info

Product

Resources

About