Dynamic Layer Aggregation for Neural Machine Translation with Routing-by-Agreement

Dou, Zi-Yi; Tu, Zhaopeng; Wang, Xing; Wang, Longyue; Shi, Shuming; Zhang, Tong

doi:10.1609/aaai.v33i01.330186

Cited by 40 publications

(23 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A number of recent efforts have explored ways to improve multi-head SAN by encouraging individual attention heads to extract distinct information (Strubell et al, 2018;. Concerning multi-layer SAN encoder, Dou et al (2018Dou et al ( , 2019 and propose to aggregate the multi-layer representations, and Dehghani et al (2019) recurrently refine these representations. Our approach is complementary to theirs, since they focus on improving the representation power of SAN encoder, while we aim to complement SAN encoder with an additional recurrence encoder.…”

Section: Short-cut Effectmentioning

confidence: 99%

Modeling Recurrence for Transformer

Hao¹,

Wang²,

Yang³

et al. 2019

Proceedings of the 2019 Conference of the North

Self Cite

View full text Add to dashboard Cite

Recently, the Transformer model (Vaswani et al., 2017) that is based solely on attention mechanisms, has advanced the state-of-the-art on various machine translation tasks. However, recent studies reveal that the lack of recurrence hinders its further improvement of translation capacity Dehghani et al., 2019). In response to this problem, we propose to directly model recurrence for Transformer with an additional recurrence encoder. In addition to the standard recurrent neural network, we introduce a novel attentive recurrent network to leverage the strengths of both attention and recurrent networks. Experimental results on the widely-used WMT14 English⇒German and WMT17 Chinese⇒English translation tasks demonstrate the effectiveness of the proposed approach. Our studies also reveal that the proposed model benefits from a short-cut that bridges the source and target sequences with a single recurrent layer, which outperforms its deep counterpart.

show abstract

Section: Short-cut Effectmentioning

confidence: 99%

Modeling Recurrence for Transformer

Hao¹,

Wang²,

Yang³

et al. 2019

Proceedings of the 2019 Conference of the North

Self Cite

View full text Add to dashboard Cite

show abstract

“…Exploiting deep representations have been studied to strengthen feature propagation and encourage feature reuse in NMT (Shen et al, 2018;Dou et al, 2018Dou et al, , 2019Wang et al, 2019b). All of these works mainly attend the decoder to the final output of the encoder stack, we instead coordinate the encoder and the decoder at earlier stage.…”

Section: Related Workmentioning

confidence: 99%

Multiscale Collaborative Deep Models for Neural Machine Translation

Wei¹,

Yu²,

Hu³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Recent evidence reveals that Neural Machine Translation (NMT) models with deeper neural networks can be more effective but are difficult to train. In this paper, we present a MultiScale Collaborative (MSC) framework to ease the training of NMT models that are substantially deeper than those used previously. We explicitly boost the gradient backpropagation from top to bottom levels by introducing a block-scale collaboration mechanism into deep NMT models. Then, instead of forcing the whole encoder stack directly learns a desired representation, we let each encoder block learns a fine-grained representation and enhance it by encoding spatial dependencies using a context-scale collaboration. We provide empirical evidence showing that the MSC nets are easy to optimize and can obtain improvements of translation quality from considerably increased depth. On IWSLT translation tasks with three translation directions, our extremely deep models (with 72-layer encoders) surpass strong baselines by +2.2∼+3.1 BLEU points. In addition, our deep MSC achieves a BLEU score of 30.56 on WMT14 English→German task that significantly outperforms state-of-the-art deep NMT models.

show abstract

“…Recent studies show that different encoder layers capture linguistic properties of different levels (Peters et al, 2018), and aggregating layers is of profound value to better fuse semantic information (Shen et al, 2018;Dou et al, 2018;Dou et al, 2019). We assume that different decoder layers may value different levels of information i.e.…”

Section: Inputmentioning

confidence: 99%

Learning Source Phrase Representations for Neural Machine Translation

Xu¹,

Genabith²,

Xiong³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

The Transformer translation model (Vaswani et al., 2017) based on a multi-head attention mechanism can be computed effectively in parallel and has significantly pushed forward the performance of Neural Machine Translation (NMT). Though intuitively the attentional network can connect distant words via shorter network paths than RNNs, empirical analysis demonstrates that it still has difficulty in fully capturing long-distance dependencies (Tang et al., 2018). Considering that modeling phrases instead of words has significantly improved the Statistical Machine Translation (SMT) approach through the use of larger translation blocks ("phrases") and its reordering ability, modeling NMT at phrase level is an intuitive proposal to help the model capture long-distance relationships. In this paper, we first propose an attentive phrase representation generation mechanism which is able to generate phrase representations from corresponding token representations. In addition, we incorporate the generated phrase representations into the Transformer translation model to enhance its ability to capture long-distance relationships. In our experiments, we obtain significant improvements on the WMT 14 English-German and English-French tasks on top of the strong Transformer baseline, which shows the effectiveness of our approach. Our approach helps Transformer Base models perform at the level of Transformer Big models, and even significantly better for long sentences, but with substantially fewer parameters and training steps. The fact that phrase representations help even in the big setting further supports our conjecture that they make a valuable contribution to long-distance relations.

show abstract

Dynamic Layer Aggregation for Neural Machine Translation with Routing-by-Agreement

Cited by 40 publications

References 18 publications

Modeling Recurrence for Transformer

Modeling Recurrence for Transformer

Multiscale Collaborative Deep Models for Neural Machine Translation

Learning Source Phrase Representations for Neural Machine Translation

Contact Info

Product

Resources

About