Ryokan Ri scite author profile

While the progress of machine translation of written text has come far in the past several years thanks to the increasing availability of parallel corpora and corpora-based training technologies, automatic translation of spoken text and dialogues remains challenging even for modern systems. In this paper, we aim to boost the machine translation quality of conversational texts by introducing a newly constructed Japanese-English business conversation parallel corpus. A detailed analysis of the corpus is provided along with challenging examples for automatic translation. We also experiment with adding the corpus in a machine translation training scenario and show how the resulting system benefits from its use.

show abstract

Zero-pronoun Data Augmentation for Japanese-to-English Translation

Ri¹,

Nakazawa²,

Tsuruoka³

2021

View full text Add to dashboard Cite

For Japanese-to-English translation, zero pronouns in Japanese pose a challenge, since the model needs to infer and produce the corresponding pronoun in the target side of the English sentence. However, although fully resolving zero pronouns often needs discourse context, in some cases, the local context within a sentence gives clues to the inference of the zero pronoun. In this study, we propose a data augmentation method that provides additional training signals for the translation model to learn correlations between local context and zero pronouns. We show that the proposed method significantly improves the accuracy of zero pronoun translation with machine translation experiments in the conversational domain.

show abstract

Pretraining with Artificial Language: Studying Transferable Knowledge in Language Models

Ri¹,

Tsuruoka²

2022

View full text Add to dashboard Cite

Revisiting the Context Window for Cross-lingual Word Embeddings

Ri¹,

Tsuruoka²

2020

View full text Add to dashboard Cite

Existing approaches to mapping-based crosslingual word embeddings are based on the assumption that the source and target embedding spaces are structurally similar. The structures of embedding spaces largely depend on the cooccurrence statistics of each word, which the choice of context window determines. Despite this obvious connection between the context window and mapping-based cross-lingual embeddings, their relationship has been underexplored in prior work. In this work, we provide a thorough evaluation, in various languages, domains, and tasks, of bilingual embeddings trained with different context windows. The highlight of our findings is that increasing the size of both the source and target window sizes improves the performance of bilingual lexicon induction, especially the performance on frequent nouns.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Ryokan Ri

mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models

Designing the Business Conversation Corpus

Zero-pronoun Data Augmentation for Japanese-to-English Translation

Pretraining with Artificial Language: Studying Transferable Knowledge in Language Models

Revisiting the Context Window for Cross-lingual Word Embeddings

Contact Info

Product

Resources

About