Beyond Weight Tying: Learning Joint Input-Output Embeddings for Neural Machine Translation

Pappas, Nikolaos; Werlen, Lesly Miculicich; Henderson, James

doi:10.18653/v1/w18-6308

Cited by 8 publications

(5 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A linear transformation layer with a softmax activation is employed to convert the decoder's output representations H L d into output probabilities over the target vocabulary. To further improve the model's performance, recent work (Inan et al 2016;Pappas et al 2018) has proposed a linear transformation layer sharing the same weights with the word embedding layers of the decoder and encoder subnetworks. Furthermore, this strategy reduces the size of the model in terms of the number of trainable parameters.…”

Section: The Transformer Modelmentioning

confidence: 99%

Dual contextual module for neural machine translation

Ampomah

McClean

Hawe

2021

Machine Translation

View full text Add to dashboard Cite

Self-attention-based encoder-decoder frameworks have drawn increasing attention in recent years. The self-attention mechanism generates contextual representations by attending to all tokens in the sentence. Despite improvements in performance, recent research argues that the self-attention mechanism tends to concentrate more on the global context with less emphasis on the contextual information available within the local neighbourhood of tokens. This work presents the Dual Contextual (DC) module, an extension of the conventional self-attention unit, to effectively leverage both the local and global contextual information. The goal is to further improve the sentence representation ability of the encoder and decoder subnetworks, thus enhancing the overall performance of the translation model. Experimental results on WMT’14 English-German (En$$\rightarrow $$ → De) and eight IWSLT translation tasks show that the DC module can further improve the translation performance of the Transformer model.

show abstract

Section: The Transformer Modelmentioning

confidence: 99%

Dual contextual module for neural machine translation

Ampomah

McClean

Hawe

2021

Machine Translation

View full text Add to dashboard Cite

show abstract

“…The improvement is thought to be because otherwise only one input embedding is updated each step and the gradient has to propagate a long way through the model to reach it. Subsequent work has explored more advanced forms of tying, recognising that the role of the input and output matrices are not exactly the same (Pappas et al, 2018). This asymmetry has been found in the actual embedding spaces learned and shown to have a negative effect on performance (Gao et al, 2019;Demeter et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

Improving Low Compute Language Modeling with In-Domain Embedding Initialisation

Welch

Mihalcea

Kummerfeld

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Many NLP applications, such as biomedical data and technical support, have 10-100 million tokens of in-domain data and limited computational resources for learning from it. How should we train a language model in this scenario? Most language modeling research considers either a small dataset with a closed vocabulary (like the standard 1 million token Penn Treebank), or the whole web with bytepair encoding. We show that for our target setting in English, initialising and freezing input embeddings using in-domain data can improve language model performance by providing a useful representation of rare words, and this pattern holds across several different domains. In the process, we show that the standard convention of tying input and output embeddings does not improve perplexity when initializing with embeddings trained on in-domain data.

show abstract

“…Takase et al (2018) extend this approach by adding what they call a Direct Output Connection, which computes the probability distribution at all layers of the NNLM. Other work has focused on weight tying such as with the Structural Aware output layer (Pappas et al, 2018;Pappas and Henderson, 2019). Despite their importance, there is limited work which attempts to further analyse these output embeddings beyond the work of Press and Wolf (2017), who show that these representations outperform the input embeddings on word similarity benchmarks.…”

Section: Related Workmentioning

confidence: 99%

Analysing Word Representation from the Input and Output Embeddings in Neural Network Language Models

Derby¹,

Miller²,

Devereux³

2020

Proceedings of the 24th Conference on Computational Natural Language Learning

View full text Add to dashboard Cite

Researchers have recently demonstrated that tying the neural weights between the input look-up table and the output classification layer can improve training and lower perplexity on sequence learning tasks such as language modelling. Such a procedure is possible due to the design of the softmax classification layer, which previous work has shown to comprise a viable set of semantic representations for the model vocabulary, and these these output embeddings are known to perform well on word similarity benchmarks. In this paper, we make meaningful comparisons between the input and output embeddings and other SOTA distributional models to gain a better understanding of the types of information they represent. We also construct a new set of word embeddings using the output embeddings to create locally-optimal approximations for the intermediate representations from the language model. These locally-optimal embeddings demonstrate excellent performance across all our evaluations.

show abstract

Beyond Weight Tying: Learning Joint Input-Output Embeddings for Neural Machine Translation

Cited by 8 publications

References 23 publications

Dual contextual module for neural machine translation

Dual contextual module for neural machine translation

Improving Low Compute Language Modeling with In-Domain Embedding Initialisation

Analysing Word Representation from the Input and Output Embeddings in Neural Network Language Models

Contact Info

Product

Resources

About