Multiscale Collaborative Deep Models for Neural Machine Translation

Wei, Xiang; Yu, Heng; Hu, Yue; Zhang, Yue; Weng, Rongxiang; Luo, Weihua

doi:10.18653/v1/2020.acl-main.40

Cited by 17 publications

(12 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Wide models can also benefit from the enlarging layer depth (Wei et al, 2020;. The RK-2 ODE Transformer achieves BLEU score of 30.76 and 44.11 on the En-De and the En-Fr tasks, significantly surpassing the standard Big model by 1.32 and 0.70 BLEU points.…”

Section: Resultsmentioning

confidence: 99%

“…Deep Transformer models: Recently, deep Transformer has witnessed tremendous success in machine translation. A straightforward way is to shorten the path from upper-level layers to lowerlevel layers thus to alleviate the gradient vanishing or exploding problems (Bapna et al, 2018;Wu et al, 2019;Wei et al, 2020). For deeper models, the training cost is nonnegligible.…”

Section: Related Workmentioning

confidence: 99%

“…These problems make the situation worse, when more layers are stacked and errors are propagated through the neural network. It might explain why recent Machine Translation (MT) systems cannot benefit from extremely deep models Wei et al, 2020;.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

ODE Transformer: An Ordinary Differential Equation-Inspired Model for Neural Machine Translation

Li,

Du,

Zhou

et al. 2021

Preprint

View full text Add to dashboard Cite

It has been found that residual networks are an Euler discretization of solutions to Ordinary Differential Equations (ODEs). In this paper, we explore a deeper relationship between Transformer and numerical methods of ODEs. We show that a residual block of layers in Transformer can be described as a higher-order solution to ODEs. This leads us to design a new architecture (call it ODE Transformer) analogous to the Runge-Kutta method that is well motivated in ODEs. As a natural extension to Transformer, ODE Transformer is easy to implement and parameter efficient. Our experiments on three WMT tasks demonstrate the genericity of this model, and large improvements in performance over several strong baselines. It achieves 30.76 and 44.11 BLEU scores on the WMT'14 En-De and En-Fr test data. This sets a new state-of-the-art on the WMT'14 En-Fr task.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

ODE Transformer: An Ordinary Differential Equation-Inspired Model for Neural Machine Translation

Li,

Du,

Zhou

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In terms of domain adaptation, as models become larger (Wei et al, 2020b) and as high quality personalization becomes more important to users (Buj et al, 2020), there may be a growing focus on approaches which do not adapt the entire model. Instead, we predict that memory efficiency for tuning and storing models will continue to grow in importance.…”

Section: Discussionmentioning

confidence: 99%

Domain Adaptation and Multi-Domain Adaptation for Neural Machine Translation: A Survey

Saunders¹

2021

Preprint

View full text Add to dashboard Cite

The development of deep learning techniques has allowed Neural Machine Translation (NMT) models to become extremely powerful, given sufficient training data and training time. However, systems struggle when translating text from a new domain with a distinct style or vocabulary. Tuning on a representative training corpus allows good indomain translation, but such data-centric approaches can cause over-fitting to new data and 'catastrophic forgetting' of previously learned behaviour.We concentrate on more robust approaches to domain adaptation for NMT, particularly the case where a system may need to translate sentences from multiple domains. We divide techniques into those relating to data selection, model architecture, parameter adaptation procedure, and inference procedure. We finally highlight the benefits of domain adaptation and multi-domain adaptation techniques to other lines of NMT research.

show abstract

“…More stacked layers lead to a stronger model of representing the sentence. This particularly makes sense in the deep NMT scenario because it has been proven that deep models can benefit from an enriched representation Wu et al, 2019b;Wei et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

Shallow-to-Deep Training for Neural Machine Translation

Wang

Liu

et al. 2020

Preprint

View full text Add to dashboard Cite

Deep encoders have been proven to be effective in improving neural machine translation (NMT) systems, but training an extremely deep encoder is time consuming. Moreover, why deep models help NMT is an open question. In this paper, we investigate the behavior of a well-tuned deep Transformer system. We find that stacking layers is helpful in improving the representation ability of NMT models and adjacent layers perform similarly. This inspires us to develop a shallowto-deep training method that learns deep models by stacking shallow models. In this way, we successfully train a Transformer system with a 54-layer encoder. Experimental results on WMT'16 English-German and WMT'14 English-French translation tasks show that it is 1.4 × faster than training from scratch, and achieves a BLEU score of 30.33 and 43.29 on two tasks. The code is publicly available at https://github.com/libeineu/SDT-Training.

show abstract

Multiscale Collaborative Deep Models for Neural Machine Translation

Cited by 17 publications

References 29 publications

ODE Transformer: An Ordinary Differential Equation-Inspired Model for Neural Machine Translation

ODE Transformer: An Ordinary Differential Equation-Inspired Model for Neural Machine Translation

Domain Adaptation and Multi-Domain Adaptation for Neural Machine Translation: A Survey

Shallow-to-Deep Training for Neural Machine Translation

Contact Info

Product

Resources

About