Tied Transformers: Neural Machine Translation with Shared Encoder and Decoder

Xia, Yingce; He, Tianyu; Tan, Xu; Tian, Fei; He, Di; Qin, Tao

doi:10.1609/aaai.v33i01.33015466

Cited by 50 publications

(19 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Limits of sharing parameters: Concurrent studies have shown that sharing the self-attention and feed-forward layer parameters between the encoder and decoder is possible without a great loss in performance (Xia et al, 2019). However, its combination with RS perform badly.…”

Section: Discussionmentioning

confidence: 99%

“…Eventually, the RS models have the same size as that of a 1-layer model. Another approach is to share the parameters between the encoder and the decoder (Xia et al, 2019;Dabre and Fujita, 2019). We consider that this approach is orthogonal to our RS and will examine their combination in our future work.…”

Section: Related Workmentioning

confidence: 99%

“…One involves using shared inputoutput vocabulary; the same parameters are used for embeddings of both the encoder and the decoder and the softmax layer of the decoder. The other is to share the parameters between the encoder and the decoder (Xia et al, 2019).…”

Section: Recurrent Stacking Of Layersmentioning

confidence: 99%

See 2 more Smart Citations

Recurrent Stacking of Layers for Compact Neural Machine Translation Models

Dabre

Fujita

2019

AAAI

View full text Add to dashboard Cite

In neural machine translation (NMT), the most common practice is to stack a number of recurrent or feed-forward layers in the encoder and the decoder. As a result, the addition of each new layer improves the translation quality significantly. However, this also leads to a significant increase in the number of parameters. In this paper, we propose to share parameters across all the layers thereby leading to a recurrently stacked NMT model. We empirically show that the translation quality of a model that recurrently stacks a single layer 6 times is comparable to the translation quality of a model that stacks 6 separate layers. We also show that using pseudo-parallel corpora by backtranslation leads to further significant improvements in translation quality.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Recurrent Stacking of Layers for Compact Neural Machine Translation Models

Dabre

Fujita

2019

AAAI

View full text Add to dashboard Cite

show abstract

“…There were two tasks conducted to evaluate the proposed dualformer for machine translation in terms of BLEU [25,26,27,28,29]. The standard transformer and other dual learning methods were implemented for comparison.…”

Section: Methodsmentioning

confidence: 99%

Dualformer: A Unified Bidirectional Sequence-to-Sequence Learning

Chien

Chang

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This paper presents a new dual domain mapping based on a unified bidirectional sequence-to-sequence (seq2seq) learning. Traditionally, dual learning in domain mapping was constructed with intrinsic connection where the conditional generative models in two directions were mutually leveraged and combined. The additional feedback from the other generation direction was used to regularize sequential learning in original direction of domain mapping. Domain matching between source sequence and target sequence was accordingly improved. However, the reconstructions for knowledge in two domains were ignored. The dual information based on separate models in two training directions was not sufficiently discovered. To cope with this weakness, this study proposes a closed-loop seq2seq learning where domain mapping and domain knowledge are jointly learned. In particular, a new feature-level dual learning is incorporated to build a dualformer where feature integration and feature reconstruction are further performed to bridge dual tasks. Experiments demonstrate the merit of the proposed dualformer for machine translation based on the multi-objective seq2seq learning.

show abstract

“…Our method ties the parameters of multiple models, which is orthogonal to the work that ties parameters between layers (Dabre and Fujita, 2019) and/or between the encoder and decoder within a single model (Xia et al, 2019;Dabre and Fujita, 2019). Parameter tying leads to compact models, but they usually suffer from drops in inference quality.…”

Section: Related Workmentioning

confidence: 99%

Balancing Cost and Benefit with Tied-Multi Transformers

Dabre¹,

Rubino²,

Fujita³

2020

Proceedings of the Fourth Workshop on Neural Generation and Translation

View full text Add to dashboard Cite

We propose a novel procedure for training multiple Transformers with tied parameters which compresses multiple models into one enabling the dynamic choice of the number of encoder and decoder layers during decoding. In training an encoder-decoder model, typically, the output of the last layer of the N -layer encoder is fed to the M -layer decoder, and the output of the last decoder layer is used to compute loss. Instead, our method computes a single loss consisting of N × M losses, where each loss is computed from the output of one of the M decoder layers connected to one of the N encoder layers. Such a model subsumes N × M models with different number of encoder and decoder layers, and can be used for decoding with fewer than the maximum number of encoder and decoder layers. Given our flexible tied model, we also address to a-priori selection of the number of encoder and decoder layers for faster decoding, and explore recurrent stacking of layers and knowledge distillation for model compression. We present a cost-benefit analysis of applying the proposed approaches for neural machine translation and show that they reduce decoding costs while preserving translation quality.

show abstract

Tied Transformers: Neural Machine Translation with Shared Encoder and Decoder

Cited by 50 publications

References 17 publications

Recurrent Stacking of Layers for Compact Neural Machine Translation Models

Recurrent Stacking of Layers for Compact Neural Machine Translation Models

Dualformer: A Unified Bidirectional Sequence-to-Sequence Learning

Balancing Cost and Benefit with Tied-Multi Transformers

Contact Info

Product

Resources

About