Prediction Difference Regularization against Perturbation for Neural Machine Translation

Guo, Dengji; Ma, Zhijun; Zhang, Min; Feng, Yang

doi:10.18653/v1/2022.acl-long.528

Cited by 6 publications

(3 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since it is difficult to train an end-to-end ST model directly, some training techniques like pretraining (Weiss et al, 2017;Berard et al, 2018;Bansal et al, 2019;Stoian et al, 2020;Wang et al, 2020b;Dong et al, 2021a;Alinejad and Sarkar, 2020;Zheng et al, 2021b;, multi-task learning (Le et al, 2020;Vydana et al, 2021;Tang et al, 2021b;Ye et al, 2021;Tang et al, 2021a), curriculum learning (Kano et al, 2017;Wang et al, 2020c), and meta-learning (Indurthi et al, 2020) have been applied. Recent work has introduced mixup on machine translation (Zhang et al, 2019b;Guo et al, 2022;Fang and Feng, 2022), sentence classification (Chen et al, 2020;Jindal et al, 2020;Sun et al, 2020), multilingual understanding , and speech recognition (Medennikov et al, 2018;Sun et al, 2021;Lam et al, 2021a;Meng et al, 2021), and obtained enhancements.…”

Section: Can the Final Model Still Perform Mt Task?mentioning

confidence: 99%

STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

Qingkai¹,

Ye²,

Li³

et al. 2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

How to learn a better speech representation for end-to-end speech-to-text translation (ST) with limited labeled data? Existing techniques often attempt to transfer powerful machine translation (MT) capabilities to ST, but neglect the representation discrepancy across modalities. In this paper, we propose the Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy. Specifically, we mix up the representation sequences of different modalities, and take both unimodal speech sequences and multimodal mixed sequences as input to the translation model in parallel, and regularize their output predictions with a selflearning framework. Experiments on MuST-C speech translation benchmark and further analysis show that our method effectively alleviates the cross-modal representation discrepancy, and achieves significant improvements over a strong baseline on eight translation directions.

show abstract

Section: Can the Final Model Still Perform Mt Task?mentioning

confidence: 99%

STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

Qingkai¹,

Ye²,

Li³

et al. 2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

show abstract

“…Neural machine translation (NMT) (Bahdanau et al, 2014) has made great progress in recent years (Barrault et al, 2020;Guo et al, 2022). However, as the input text exceeds a single sentence, sentence-level NMT methods will fail to capture discourse phenomena, such as pronominal anaphora, lexical consistency, and document coherence.…”

Section: Introductionmentioning

confidence: 99%

Scaling Law for Document Neural Machine Translation

Zhuocheng,

Gu,

Zhang

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

The scaling laws of language models have played a significant role in advancing large language models. However, the scaling law of document-level neural machine translation remains unclear. In order to promote the development of document-level neural machine translation, we systematically examine the scaling laws in this field. In this paper, we carry out an in-depth analysis of the influence among three factors on translation quality: model scale, data scale, and maximum sequence length. Our results indicate that all three factors have a significant impact on model performance. In particular, increasing the maximum sequence length effectively reduces the context-related errors and improves the overall translation quality. Nevertheless, the sequence length cannot be increased indefinitely, as the number of parameters limits the optimal sequence length. Specifically, we propose a formula describing the empirical scaling law between the model size and the optimal sequence length. Our further analysis shows that the error accumulation problem is the primary factor that hindering further improvement in translation quality for the document-level translation by extending the sequence length.

show abstract

“…The encoder "understands" the sentence in the source language to form a dimension-specific floatingpoint vector from which the decoder generates a word-by-word translation of the target language. In its infancy, RNN 5 , LSTM 6 , GRU 7 , and other structures were widely used as network structures for encoder and decoder in NMT 8 . In 2017, the Transformer 9 came out, which not only dramatically surpasses RNN-based neural networks in translation effect but achieves training efficiency through parallelization of training.…”

Section: Introductionmentioning

confidence: 99%