Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.40
|View full text |Cite
|
Sign up to set email alerts
|

Multiscale Collaborative Deep Models for Neural Machine Translation

Abstract: Recent evidence reveals that Neural Machine Translation (NMT) models with deeper neural networks can be more effective but are difficult to train. In this paper, we present a MultiScale Collaborative (MSC) framework to ease the training of NMT models that are substantially deeper than those used previously. We explicitly boost the gradient backpropagation from top to bottom levels by introducing a block-scale collaboration mechanism into deep NMT models. Then, instead of forcing the whole encoder stack directl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 17 publications
(12 citation statements)
references
References 29 publications
0
11
0
Order By: Relevance
“…Wide models can also benefit from the enlarging layer depth (Wei et al, 2020;. The RK-2 ODE Transformer achieves BLEU score of 30.76 and 44.11 on the En-De and the En-Fr tasks, significantly surpassing the standard Big model by 1.32 and 0.70 BLEU points.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…Wide models can also benefit from the enlarging layer depth (Wei et al, 2020;. The RK-2 ODE Transformer achieves BLEU score of 30.76 and 44.11 on the En-De and the En-Fr tasks, significantly surpassing the standard Big model by 1.32 and 0.70 BLEU points.…”
Section: Resultsmentioning
confidence: 99%
“…Deep Transformer models: Recently, deep Transformer has witnessed tremendous success in machine translation. A straightforward way is to shorten the path from upper-level layers to lowerlevel layers thus to alleviate the gradient vanishing or exploding problems (Bapna et al, 2018;Wu et al, 2019;Wei et al, 2020). For deeper models, the training cost is nonnegligible.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…In terms of domain adaptation, as models become larger (Wei et al, 2020b) and as high quality personalization becomes more important to users (Buj et al, 2020), there may be a growing focus on approaches which do not adapt the entire model. Instead, we predict that memory efficiency for tuning and storing models will continue to grow in importance.…”
Section: Discussionmentioning
confidence: 99%
“…More stacked layers lead to a stronger model of representing the sentence. This particularly makes sense in the deep NMT scenario because it has been proven that deep models can benefit from an enriched representation Wu et al, 2019b;Wei et al, 2020).…”
Section: Introductionmentioning
confidence: 99%