Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 2018
DOI: 10.18653/v1/d18-1457
|View full text |Cite
|
Sign up to set email alerts
|

Exploiting Deep Representations for Neural Machine Translation

Abstract: Advanced neural machine translation (NMT) models generally implement encoder and decoder as multiple layers, which allows systems to model complex functions and capture complicated linguistic structures. However, only the top layers of encoder and decoder are leveraged in the subsequent process, which misses the opportunity to exploit the useful information embedded in other layers. In this work, we propose to simultaneously expose all of these signals with layer aggregation and multi-layer attention mechanism… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
73
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
4
1

Relationship

2
7

Authors

Journals

citations
Cited by 74 publications
(74 citation statements)
references
References 25 publications
(38 reference statements)
1
73
0
Order By: Relevance
“…For example, the standard residual network is a special case of DLCL, where W l+1 l = 1, and W l+1 k = 0 for k < l. Figure (2) compares different methods of connecting a 3-layer network. We see that the densely residual network is a fully-connected network with a uniform weighting schema (Britz et al, 2017;Dou et al, 2018). Multi-layer representation fusion (Wang et al, 2018b) and transparent attention (call it TA) (Bapna et al, 2018) methods can learn a weighted model to fuse layers but they are applied to the topmost layer only.…”
Section: Dynamic Linear Combination Of Layersmentioning
confidence: 99%
“…For example, the standard residual network is a special case of DLCL, where W l+1 l = 1, and W l+1 k = 0 for k < l. Figure (2) compares different methods of connecting a 3-layer network. We see that the densely residual network is a fully-connected network with a uniform weighting schema (Britz et al, 2017;Dou et al, 2018). Multi-layer representation fusion (Wang et al, 2018b) and transparent attention (call it TA) (Bapna et al, 2018) methods can learn a weighted model to fuse layers but they are applied to the topmost layer only.…”
Section: Dynamic Linear Combination Of Layersmentioning
confidence: 99%
“…A number of recent efforts have explored ways to improve multi-head SAN by encouraging individual attention heads to extract distinct information (Strubell et al, 2018;. Concerning multi-layer SAN encoder, Dou et al (2018Dou et al ( , 2019 and propose to aggregate the multi-layer representations, and Dehghani et al (2019) recurrently refine these representations. Our approach is complementary to theirs, since they focus on improving the representation power of SAN encoder, while we aim to complement SAN encoder with an additional recurrence encoder.…”
Section: Short-cut Effectmentioning
confidence: 99%
“…Within recent literature, several strategies for altering the flow of information within the transformer have been proposed, including adaptive model depth (Dehghani et al, 2018), layer-wise transparent attention , and dense inter-layer connections (Dou et al, 2018). Our investigation bears strongest resemblance to the latter work, by introducing additional connectivity to the model.…”
Section: Related Workmentioning
confidence: 76%
“…While adding shortcuts improves translation quality, it is not obvious whether this is predominantly due to improved accessibility of lexical content, rather than increased connectivity between network layers, as suggested in (Dou et al, 2018). To isolate the importance of lexical information, we equip the transformer with nonlexical shortcuts connecting each layer n to layer n − 2, e.g.…”
Section: Shortcut Variantsmentioning
confidence: 99%