Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work?

Pruksachatkun, Yada; Phang, Jason; Liu, Haokun; Htut, Phu Mon; Zhang, Xiaoyi; Pang, Richard Yuanzhe; Vania, Clara; Kann, Katharina; Bowman, Samuel R.

doi:10.18653/v1/2020.acl-main.467

Cited by 138 publications

(99 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…task/transfer learning, often improves over standard single-task learning (Ruder, 2017). Within multitask learning, several works (e.g., Luong et al, 2016;Liu et al, 2019b;Raffel et al, 2020) (Pruksachatkun et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

Exploring and Predicting Transferability across NLP Tasks

Vu¹,

Wang²,

Munkhdalai³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

107

View full text Add to dashboard Cite

Recent advances in NLP demonstrate the effectiveness of training large-scale language models and transferring them to downstream tasks. Can fine-tuning these models on tasks other than language modeling further improve performance? In this paper, we conduct an extensive study of the transferability between 33 NLP tasks across three broad classes of problems (text classification, question answering, and sequence labeling). Our results show that transfer learning is more beneficial than previously thought, especially when target task data is scarce, and can improve performance even with low-data source tasks that differ substantially from the target task (e.g., part-ofspeech tagging transfers well to the DROP QA dataset). We also develop task embeddings that can be used to predict the most transferable source tasks for a given target task, and we validate their effectiveness in experiments controlled for source and target data size. Overall, our experiments reveal that factors such as data size, task and domain similarity, and task complexity all play a role in determining transferability.

show abstract

Section: Related Workmentioning

confidence: 99%

Exploring and Predicting Transferability across NLP Tasks

Vu¹,

Wang²,

Munkhdalai³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

107

View full text Add to dashboard Cite

show abstract

“…As an example, Phang et al (2018) show that downstream accuracy can benefit from an intermediate fine-tuning task, but leave the investigation of why certain tasks benefit from intermediate task training to future work. Recently, Pruksachatkun et al (2020) extended this approach using eleven diverse intermediate fine-tuning tasks. They view probing task performance after finetuning as an indicator of the acquisition of a particular language skill during intermediate task finetuning.…”

Section: Related Workmentioning

confidence: 99%

On the Interplay Between Fine-tuning and Sentence-Level Probing for Linguistic Knowledge in Pre-Trained Transformers

Mosbach¹,

Khokhlova²,

Hedderich³

et al. 2020

Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

View full text Add to dashboard Cite

Fine-tuning pre-trained contextualized embedding models has become an integral part of the NLP pipeline. At the same time, probing has emerged as a way to investigate the linguistic knowledge captured by pre-trained models. Very little is, however, understood about how fine-tuning affects the representations of pre-trained models and thereby the linguistic knowledge they encode. This paper contributes towards closing this gap. We study three different pre-trained models: BERT, RoBERTa, and ALBERT, and investigate through sentence-level probing how finetuning affects their representations. We find that for some probing tasks fine-tuning leads to substantial changes in accuracy, possibly suggesting that fine-tuning introduces or even removes linguistic knowledge from a pre-trained model. These changes, however, vary greatly across different models, fine-tuning and probing tasks. Our analysis reveals that while finetuning indeed changes the representations of a pre-trained model and these changes are typically larger for higher layers, only in very few cases, fine-tuning has a positive effect on probing accuracy that is larger than just using the pre-trained model with a strong pooling method. Based on our findings, we argue that both positive and negative effects of finetuning on probing require a careful interpretation.

show abstract

“…Note that none of the individual tasks in XTREME covers all 40 languages, but much smaller language subsets.3 We leave an even more general analysis that combines transfer both across tasks(Pruksachatkun et al, 2020;Glavaš and Vulić, 2020) and across languages for future work.…”

mentioning

confidence: 99%

From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers

Lauscher¹,

Ravishankar²,

Vulić³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

198

191

View full text Add to dashboard Cite

Massively multilingual transformers (MMTs) pretrained via language modeling (e.g., mBERT, XLM-R) have become a default paradigm for zero-shot language transfer in NLP, offering unmatched transfer performance. Current evaluations, however, verify their efficacy in transfers (a) to languages with sufficiently large pretraining corpora, and (b) between close languages. In this work, we analyze the limitations of downstream language transfer with MMTs, showing that, much like cross-lingual word embeddings, they are substantially less effective in resource-lean scenarios and for distant languages. Our experiments, encompassing three lower-level tasks (POS tagging, dependency parsing, NER) and two high-level tasks (NLI, QA), empirically correlate transfer performance with linguistic proximity between source and target languages, but also with the size of target language corpora used in MMT pretraining. Most importantly, we demonstrate that the inexpensive few-shot transfer (i.e., additional fine-tuning on a few target-language instances) is surprisingly effective across the board, warranting more research efforts reaching beyond the limiting zero-shot conditions.

show abstract

Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work?

Cited by 138 publications

References 39 publications

Exploring and Predicting Transferability across NLP Tasks

Exploring and Predicting Transferability across NLP Tasks

On the Interplay Between Fine-tuning and Sentence-Level Probing for Linguistic Knowledge in Pre-Trained Transformers

From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers

Contact Info

Product

Resources

About