Proceedings of the Third Conference on Machine Translation: Research Papers 2018
DOI: 10.18653/v1/w18-6308
|View full text |Cite
|
Sign up to set email alerts
|

Beyond Weight Tying: Learning Joint Input-Output Embeddings for Neural Machine Translation

Abstract: Tying the weights of the target word embeddings with the target word classifiers of neural machine translation models leads to faster training and often to better translation quality. Given the success of this parameter sharing, we investigate other forms of sharing in between no sharing and hard equality of parameters. In particular, we propose a structure-aware output layer which captures the semantic structure of the output space of words within a joint input-output embedding. The model is a generalized for… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 8 publications
(5 citation statements)
references
References 23 publications
0
5
0
Order By: Relevance
“…A linear transformation layer with a softmax activation is employed to convert the decoder's output representations H L d into output probabilities over the target vocabulary. To further improve the model's performance, recent work (Inan et al 2016;Pappas et al 2018) has proposed a linear transformation layer sharing the same weights with the word embedding layers of the decoder and encoder subnetworks. Furthermore, this strategy reduces the size of the model in terms of the number of trainable parameters.…”
Section: The Transformer Modelmentioning
confidence: 99%
“…A linear transformation layer with a softmax activation is employed to convert the decoder's output representations H L d into output probabilities over the target vocabulary. To further improve the model's performance, recent work (Inan et al 2016;Pappas et al 2018) has proposed a linear transformation layer sharing the same weights with the word embedding layers of the decoder and encoder subnetworks. Furthermore, this strategy reduces the size of the model in terms of the number of trainable parameters.…”
Section: The Transformer Modelmentioning
confidence: 99%
“…The improvement is thought to be because otherwise only one input embedding is updated each step and the gradient has to propagate a long way through the model to reach it. Subsequent work has explored more advanced forms of tying, recognising that the role of the input and output matrices are not exactly the same (Pappas et al, 2018). This asymmetry has been found in the actual embedding spaces learned and shown to have a negative effect on performance (Gao et al, 2019;Demeter et al, 2020).…”
Section: Related Workmentioning
confidence: 99%
“…Takase et al (2018) extend this approach by adding what they call a Direct Output Connection, which computes the probability distribution at all layers of the NNLM. Other work has focused on weight tying such as with the Structural Aware output layer (Pappas et al, 2018;Pappas and Henderson, 2019). Despite their importance, there is limited work which attempts to further analyse these output embeddings beyond the work of Press and Wolf (2017), who show that these representations outperform the input embeddings on word similarity benchmarks.…”
Section: Related Workmentioning
confidence: 99%