2020
DOI: 10.1162/tacl_a_00313
|View full text |Cite
|
Sign up to set email alerts
|

Leveraging Pre-trained Checkpoints for Sequence Generation Tasks

Abstract: Unsupervised pre-training of large neural models has recently revolutionized Natural Language Processing. By warm-starting from the publicly released checkpoints, NLP practitioners have pushed the state-of-the-art on multiple benchmarks while saving significant amounts of compute time. So far the focus has been mainly on the Natural Language Understanding tasks. In this paper, we demonstrate the efficacy of pre-trained checkpoints for Sequence Generation. We developed a Transformer-based sequence-to-sequence m… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
256
1
2

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 326 publications
(293 citation statements)
references
References 29 publications
3
256
1
2
Order By: Relevance
“…Devlin et al [2019] proposed BERT based on masked language modeling and next sentence prediction, and achieved state-of-theart results on multiple NLP tasks. There are also some works on pre-training the encoder-decoder model for language generation [Rothe et al, 2019;Edunov et al, 2019;Liu and Lapata, 2019]. The main difference between our generation model and others are that our model uses pre-trained BERT model in the encoder side and uses a non-pre-trained Transformer on the decoder side, and we fine-tune the encoder and train the decoder using two separate optimizers.…”
Section: Quanzhi LI and Qiong Zhangmentioning
confidence: 99%
See 1 more Smart Citation
“…Devlin et al [2019] proposed BERT based on masked language modeling and next sentence prediction, and achieved state-of-theart results on multiple NLP tasks. There are also some works on pre-training the encoder-decoder model for language generation [Rothe et al, 2019;Edunov et al, 2019;Liu and Lapata, 2019]. The main difference between our generation model and others are that our model uses pre-trained BERT model in the encoder side and uses a non-pre-trained Transformer on the decoder side, and we fine-tune the encoder and train the decoder using two separate optimizers.…”
Section: Quanzhi LI and Qiong Zhangmentioning
confidence: 99%
“…The loss of the joint learning model is the sum of loss of the three tasks. There are previous studies on pre-training encoder-decoder model for language generation [Rothe et al, 2019;Edunov et al, 2019;Liu and Lapata, 2019], and some of them also use different optimizers for different components.…”
Section: Model Training and Inferencementioning
confidence: 99%
“…The hidden states generated by the encoder for the entire input sequence are passed to the decoder, thus allowing the decoder to attend over the entire input sequence during each decoding step. This model serves as our primary baseline, as it is identical to the BERT2RND model in Rothe et al (2019). We use the same hyperparameters as Rothe et al (2019), which were selected after extensive tuning.…”
Section: Modelsmentioning
confidence: 99%
“…This model serves as our primary baseline, as it is identical to the BERT2RND model in Rothe et al (2019). We use the same hyperparameters as Rothe et al (2019), which were selected after extensive tuning.…”
Section: Modelsmentioning
confidence: 99%
See 1 more Smart Citation