Rethinking Data Augmentation for Low-Resource Neural Machine Translation: A Multi-Task Learning Approach

Sánchez-Cartagena, Víctor M.; Esplà-Gomis, Miquel; Pérez-Ortiz, Juan Antonio; Sánchez-Martínez, Felipe

doi:10.18653/v1/2021.emnlp-main.669

Cited by 14 publications

(12 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our framework -which extends preliminary work 1 reported in a conference paper by the same authors [13]does not require elaborate preprocessing steps, training additional systems, or data besides the available training parallel corpora. Experiments with ten low-resource translation tasks show that it systematically outperforms state-of-the art methods aimed at extending the support of the empirical data distribution.…”

Section: Introductionmentioning

confidence: 61%

“…4. This differs from previous work[13] in which transformations were applied to the original training samples during the pre-processing of the corpus, and therefore before training.…”

mentioning

confidence: 87%

“…For SeqMix, we also integrated the modifications in data processing and training loss published by their authors as a fairseq task. 13 Systems trained with MaTiLDA were fine-tuned on the original training samples after being trained on the combination of original and synthetic training samples. The results reported are those obtained with the model that maximizes BLEU on the development set.…”

Section: Trainingmentioning

confidence: 99%

“…1. The additional contributions of this paper with respect to the conference paper [13] are as follows: (i) more sophisticated approach for generating the synthetic samples during training; (ii) additional experiments on both low-resource and high-resource translation tasks; (iii) more exhaustive comparison to other DA methods; (iv) improved evaluation of the contribution of the source representations generated by the encoder during inference; and (v) better grounded analysis of the tendency to hallucinate of the models evaluated.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Non-Fluent Synthetic Target-Language Data Improve Neural Machine Translation

Sánchez-Cartagena,

Esplà-Gomis,

Pérez-Ortiz

et al. 2024

IEEE Trans. Pattern Anal. Mach. Intell.

Self Cite

View full text Add to dashboard Cite

When the amount of parallel sentences available to train a neural machine translation is scarce, a common practice is to generate new synthetic training samples from them. A number of approaches have been proposed to produce synthetic parallel sentences that are similar to those in the parallel data available. These approaches work under the assumption that non-fluent target-side synthetic training samples can be harmful and may deteriorate translation performance. Even so, in this paper we demonstrate that synthetic training samples with non-fluent target sentences can improve translation performance if they are used in a multilingual machine translation framework as if they were sentences in another language. We conducted experiments on ten low-resource and four high-resource translation tasks and found out that this simple approach consistently improves translation performance as compared to state-of-the-art methods for generating synthetic training samples similar to those found in corpora. Furthermore, this improvement is independent of the size of the original training corpus, the resulting systems are much more robust against domain shift and produce less hallucinations.

show abstract

Section: Introductionmentioning

confidence: 61%

“…4. This differs from previous work[13] in which transformations were applied to the original training samples during the pre-processing of the corpus, and therefore before training.…”

mentioning

confidence: 87%

Section: Trainingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Non-Fluent Synthetic Target-Language Data Improve Neural Machine Translation

Sánchez-Cartagena,

Esplà-Gomis,

Pérez-Ortiz

et al. 2024

IEEE Trans. Pattern Anal. Mach. Intell.

Self Cite

View full text Add to dashboard Cite

show abstract

“…The highest BLEU of 45.98 was obtained for word-level tokenization, showing that selecting the right pre-processing methods may improve the performance of the MT models. [25] created an NMT between English and Swahili using news data. Authors obtained BLEU of 27.42 for English-to-Swahili translation when using the data that included part-of-speech tags outperforming Google translate.…”

Section: Related Workmentioning

confidence: 99%

Masakhane Web: A Machine Translation Platform for African Languages

Gitau

Kabongo

Modupe

et al. 2023

Preprint

View full text Add to dashboard Cite

Low-resource languages pose a particularly difficult challenge to neu-ral machine translation (NMT), and there appears to be insufficient machine translation (MT) systems to support African language accessibility. Masakhane Web, an NMT system for African languages, is proposed in this paper. Our approach is an open-source platform that is free, flexible, and produces reasonably accurate translations for African languages. The platform makes use of Masakhane community-trained MT models. It enables users to generate new data by providing feedback on translations, which is then used to retrain the models to improve them. Ultimately, our goal is to create a platform that can provide accurate translations for African languages and make the process of creating MT models easier for those who lack the technical expertise. Furthermore, we include strategies for domain experts to evaluate the system and explain how the platform can be used as a data collection source to improve MT for African languages.

show abstract

Effective Data Augmentation Methods for CCMT 2022

Wang¹,

Yang²

2022

Communications in Computer and Information Science

View full text Add to dashboard Cite

Rethinking Data Augmentation for Low-Resource Neural Machine Translation: A Multi-Task Learning Approach

Cited by 14 publications

References 36 publications

Non-Fluent Synthetic Target-Language Data Improve Neural Machine Translation

Non-Fluent Synthetic Target-Language Data Improve Neural Machine Translation

Masakhane Web: A Machine Translation Platform for African Languages

Effective Data Augmentation Methods for CCMT 2022

Contact Info

Product

Resources

About