Exploiting Deep Representations for Neural Machine Translation

Dou, Zi-Yi; Tu, Zhaopeng; Wang, Xing; Shi, Shuming; Zhang, Tong

doi:10.18653/v1/d18-1457

Cited by 74 publications

(74 citation statements)

References 25 publications

(38 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, the standard residual network is a special case of DLCL, where W l+1 l = 1, and W l+1 k = 0 for k < l. Figure (2) compares different methods of connecting a 3-layer network. We see that the densely residual network is a fully-connected network with a uniform weighting schema (Britz et al, 2017;Dou et al, 2018). Multi-layer representation fusion (Wang et al, 2018b) and transparent attention (call it TA) (Bapna et al, 2018) methods can learn a weighted model to fuse layers but they are applied to the topmost layer only.…”

Section: Dynamic Linear Combination Of Layersmentioning

confidence: 99%

Learning Deep Transformer Models for Machine Translation

Wang¹,

Li²,

Xiao³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

424

214

View full text Add to dashboard Cite

Transformer is the state-of-the-art model in recent machine translation evaluations. Two strands of research are promising to improve models of this kind: the first uses wide networks (a.k.a. Transformer-Big) and has been the de facto standard for the development of the Transformer system, and the other uses deeper language representation but faces the difficulty arising from learning deep networks. Here, we continue the line of research on the latter. We claim that a truly deep Transformer model can surpass the Transformer-Big counterpart by 1) proper use of layer normalization and 2) a novel way of passing the combination of previous layers to the next. On WMT'16 English-German, NIST OpenMT'12 Chinese-English and larger WMT'18 Chinese-English tasks, our deep system (30/25-layer encoder) outperforms the shallow Transformer-Big/Base baseline (6-layer encoder) by 0.4∼2.4 BLEU points. As another bonus, the deep model is 1.6X smaller in size and 3X faster in training than Transformer-Big 1 . * Corresponding author. 1 The source code is available at https://github. com/wangqiangneu/dlcl

show abstract

Section: Dynamic Linear Combination Of Layersmentioning

confidence: 99%

Learning Deep Transformer Models for Machine Translation

Wang¹,

Li²,

Xiao³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

424

214

View full text Add to dashboard Cite

show abstract

“…A number of recent efforts have explored ways to improve multi-head SAN by encouraging individual attention heads to extract distinct information (Strubell et al, 2018;. Concerning multi-layer SAN encoder, Dou et al (2018Dou et al ( , 2019 and propose to aggregate the multi-layer representations, and Dehghani et al (2019) recurrently refine these representations. Our approach is complementary to theirs, since they focus on improving the representation power of SAN encoder, while we aim to complement SAN encoder with an additional recurrence encoder.…”

Section: Short-cut Effectmentioning

confidence: 99%

Modeling Recurrence for Transformer

Hao¹,

Wang²,

Yang³

et al. 2019

Proceedings of the 2019 Conference of the North

Self Cite

View full text Add to dashboard Cite

Recently, the Transformer model (Vaswani et al., 2017) that is based solely on attention mechanisms, has advanced the state-of-the-art on various machine translation tasks. However, recent studies reveal that the lack of recurrence hinders its further improvement of translation capacity Dehghani et al., 2019). In response to this problem, we propose to directly model recurrence for Transformer with an additional recurrence encoder. In addition to the standard recurrent neural network, we introduce a novel attentive recurrent network to leverage the strengths of both attention and recurrent networks. Experimental results on the widely-used WMT14 English⇒German and WMT17 Chinese⇒English translation tasks demonstrate the effectiveness of the proposed approach. Our studies also reveal that the proposed model benefits from a short-cut that bridges the source and target sequences with a single recurrent layer, which outperforms its deep counterpart.

show abstract

“…Within recent literature, several strategies for altering the flow of information within the transformer have been proposed, including adaptive model depth (Dehghani et al, 2018), layer-wise transparent attention , and dense inter-layer connections (Dou et al, 2018). Our investigation bears strongest resemblance to the latter work, by introducing additional connectivity to the model.…”

Section: Related Workmentioning

confidence: 76%

“…While adding shortcuts improves translation quality, it is not obvious whether this is predominantly due to improved accessibility of lexical content, rather than increased connectivity between network layers, as suggested in (Dou et al, 2018). To isolate the importance of lexical information, we equip the transformer with nonlexical shortcuts connecting each layer n to layer n − 2, e.g.…”

Section: Shortcut Variantsmentioning

confidence: 99%

Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts

Emelin¹,

Titov

Sennrich³

2019

Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers)

View full text Add to dashboard Cite

The transformer is a state-of-the-art neural translation model that uses attention to iteratively refine lexical representations with information drawn from the surrounding context. Lexical features are fed into the first layer and propagated through a deep network of hidden layers. We argue that the need to represent and propagate lexical features in each layer limits the model's capacity for learning and representing other information relevant to the task. To alleviate this bottleneck, we introduce gated shortcut connections between the embedding layer and each subsequent layer within the encoder and decoder. This enables the model to access relevant lexical content dynamically, without expending limited resources on storing it within intermediate states. We show that the proposed modification yields consistent improvements over a baseline transformer on standard WMT translation tasks in 5 translation directions (0.9 BLEU on average) and reduces the amount of lexical information passed along the hidden layers. We furthermore evaluate different ways to integrate lexical connections into the transformer architecture and present ablation experiments exploring the effect of proposed shortcuts on model behavior. 1

show abstract

Exploiting Deep Representations for Neural Machine Translation

Cited by 74 publications

References 25 publications

Learning Deep Transformer Models for Machine Translation

Learning Deep Transformer Models for Machine Translation

Modeling Recurrence for Transformer

Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts

Contact Info

Product

Resources

About