Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks

Phang, Jason; Févry, Thibault; Bowman, Samuel R.

doi:10.48550/arxiv.1811.01088

Cited by 115 publications

(188 citation statements)

References 0 publications

Supporting

Mentioning

178

Contrasting

Order By: Relevance

“…Within the context of transfer learning, intermediate-task training refers to fine-tuning a pre-trained model on an intermediate task before fine-tuning it on a final target task. This has been found to provide an additional improvement to target task performance compared to using the pre-trained model alone (Vu et al, 2020;Phang et al, 2018). We provide an analog of this in our merging framework by merging a model fine-tuned on the target task with a model fine-tuned on the intermediate task.…”

Section: Intermediate-task Trainingmentioning

confidence: 98%

“…In computer vision, pre-training is typically done on a large labeled dataset like ImageNet (Deng et al, 2009;Russakovsky et al, 2015), whereas applications of transfer learning to natural language processing typically pre-train through self-supervised training on a large unlabeled text corpus. Recently, it has been shown that training on an "intermediate" task between pre-training and fine-tuning can further boost performance (Phang et al, 2018;Vu et al, 2020;Pruksachatkun et al, 2020;Phang et al, 2020). Alternatively, continued self-supervised training on unlabelled domain-specialized data can serve as a form of domain adaptation (Gururangan et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Merging Models with Fisher-Weighted Averaging

Matena¹,

Raffel²

2021

Preprint

View full text Add to dashboard Cite

Transfer learning provides a way of leveraging knowledge from one task when learning another task. Performing transfer learning typically involves iteratively updating a model's parameters through gradient descent on a training dataset. In this paper, we introduce a fundamentally different method for transferring knowledge across models that amounts to "merging" multiple models into one. Our approach effectively involves computing a weighted average of the models' parameters. We show that this averaging is equivalent to approximately sampling from the posteriors of the model weights. While using an isotropic Gaussian approximation works well in some cases, we also demonstrate benefits by approximating the precision matrix via the Fisher information. In sum, our approach makes it possible to combine the "knowledge" in multiple models at an extremely low computational cost compared to standard gradient-based training. We demonstrate that model merging achieves comparable performance to gradient descent-based transfer learning on intermediate-task training and domain adaptation problems. We also show that our merging procedure makes it possible to combine models in previously unexplored ways. To measure the robustness of our approach, we perform an extensive ablation on the design of our algorithm.

show abstract

Section: Intermediate-task Trainingmentioning

confidence: 98%

Section: Introductionmentioning

confidence: 99%

Merging Models with Fisher-Weighted Averaging

Matena¹,

Raffel²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…We are interested in whether such a similarity between QA tasks and sequencepair text classification tasks can make a difference. In terms of training procedure, we follow previous works (Phang et al, 2018;Vu et al, 2020). Specifically, we first fine-tune a pre-trained LM on SQuAD-2.0 (intermediate training stage) and then fine-tune it on each text classification tasks.…”

Section: Methodsmentioning

confidence: 99%

“…Another effective transfer learning approach named intermediate training that chooses to train a LM model on an intermediate task via supervised manner and then fine-tune it on target tasks. This also leads to promising results across various NLP tasks including text classification, QA and sequence labeling (Phang et al, 2018;Vu et al, 2020;Pruksachatkun et al, 2020).…”

Section: Introductionmentioning

confidence: 93%

Does QA-based intermediate training help fine-tuning language models for text classification?

Zhang¹,

Zhang²

2021

Preprint

View full text Add to dashboard Cite

Fine-tuning pre-trained language models for downstream tasks has become a norm for NLP. Recently it is found that intermediate training based on high-level inference tasks such as Question Answering (QA) can improve the performance of some language models for target tasks. However it is not clear if intermediate training generally benefits various language models. In this paper, using the SQuAD-2.0 QA task for intermediate training for target text classification tasks, we experimented on eight tasks for single-sequence classification and eight tasks for sequence-pair classification using two base and two compact language models. Our experiments show that QAbased intermediate training generates varying transfer performance across different language models, except for similar QA tasks.

show abstract

“…Transfer Learning A large body of work has attemped to leveraged multi-task learning to endow a model with an inductive bias that improves generalization on a main task of interest (Caruana, 1998;Bakker & Heskes, 2003;Raffel et al, 2020), with recent work in NLP sharing our focus on neural networks (Sogaard & Goldberg 2016;Hashimoto et al 2016;Swayamdipta et al 2018; for a review, see Ruder 2017). Intermediate training of pre-trained sentence encoders on a task or a set of tasks that are related to the task of interest has been advocated, among others, by Phang et al (2018) and Aghajanyan et al (2021). Gururangan et al (2020) craft a training pipeline where a pre-trained language model is adapted to domain-specific and task-specific ones.…”

Section: Related Workmentioning

confidence: 99%

Learning to Generalize Compositionally by Transferring Across Semantic Parsing Tasks

Wang¹,

Shaw²,

Linzen³

et al. 2021

Preprint

View full text Add to dashboard Cite

Neural network models often generalize poorly to mismatched domains or distributions. In NLP, this issue arises in particular when models are expected to generalize compositionally, that is, to novel combinations of familiar words and constructions. We investigate learning representations that facilitate transfer learning from one compositional task to another: the representation and the task-specific layers of the models are strategically trained differently on a pre-finetuning task such that they generalize well on mismatched splits that require compositionality. We apply this method to semantic parsing, using three very different datasets, COGS, GeoQuery and SCAN, used alternately as the pre-finetuning and target task. Our method significantly improves compositional generalization over baselines on the test set of the target task, which is held out during fine-tuning. Ablation studies characterize the utility of the major steps in the proposed algorithm and support our hypothesis. INTRODUCTIONRecent work has spotlighted significant shortcomings of neural network approaches to NLP in coping with compositional generalization (CG) (

show abstract

Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks

Cited by 115 publications

References 0 publications

Merging Models with Fisher-Weighted Averaging

Merging Models with Fisher-Weighted Averaging

Does QA-based intermediate training help fine-tuning language models for text classification?

Learning to Generalize Compositionally by Transferring Across Semantic Parsing Tasks

Contact Info

Product

Resources

About