Towards a Unified View of Parameter-Efficient Transfer Learning

He, Junxian; Zhou, Chunting; Ma, Xuezhe; Berg-Kirkpatrick, Taylor; Neubig, Graham

doi:10.48550/arxiv.2110.04366

Cited by 31 publications

(55 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Although each of these three approaches has its own focus, the central idea is to keep the pre-trained parameters constant while training lightweight alternatives to achieve adaptation for downstream tasks. There have also been some recent attempts to grasp the internal connection of these strategies and build a unified parameter-efficient tuning framework [333,334].…”

Section: Parameter-efficient Tuningmentioning

confidence: 99%

A Roadmap for Big Model

Yuan¹,

Zhao²,

Jiahong³

et al. 2022

Preprint

View full text Add to dashboard Cite

domains indexed by Google News. It contains 31 million documents with an average length of 793 BPE tokens. Like C4, it excludes examples with duplicate URLs. News dumps from December 2016 through March 2019 were used as training data, articles published in April 2019 from the April 2019 dump were used for evaluation. OpenWebText2(OWT2). OWT2 is an enhanced version of the original OpenWebTextCorpus, including content from multiple languages, document metadata, multiple dataset versions, and open source replication code, covering all Reddit submissions from 2005 up until April 2020. PubMed Central(PMC). PMC is a free full-text archive of biomedical and life sciences journal literature from the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). The dataset is updated daily. In addition to full-text articles, they contain corrections, retractions, and expressions of concern, as well as file lists that include metadata for articles in each dataset.PMC obtained by open registration in Amazon Web Services (AWS) includes The PMC Open Access Subset and The Author Manuscript Dataset. The PMC Open Access Subset includes all articles and preprints in PMC with a machine-readable Creative Commons license that allows reuse. The Author Manuscript Dataset includes accepted author manuscripts collected under a funder policy in PMC and made available in machine-readable formats for text mining. ArXiv. ArXiv is a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more. It provides open access to academic articles, covering many subdisciplines from vast branches of physics to computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics, which is helpful to the potential downstream applications of the research field. In addition, the writing language of LaTeX also contributes to the study of language models. Colossal Clean Crawled Corpus(C4). C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It is based on Common Crawl dataset and was used to train the T5 text-to-text Transformer models. The cleaned English version of C4 has 364,868,901 training examples and 364,608 validation examples, while the uncleaned English version has 1,063,805,324 training examples and 1,065,029 validation examples; the realnewslike version has 13,799,838 training examples and 13,863 validation examples, while the webtextlike version has 4,500,788 training examples and 4,493 validation examples. Wiki-40B. Wikipedia (Wiki-40B) is a clean-up text collection containing more than 40 Wikipedia language editions of pages corresponding to entities. The dataset is split into train/validation/test sets for each language. The training set has 2,926,536 examples, the validation set has 163,597 examples, and the test set has 162,274 examples. Wiki-40B is cleaned by a page filter to remove ambiguous, redirected, deleted, and non-physical pages. CLUECorpus2020. CLUECorpus2020 ...

show abstract

Section: Parameter-efficient Tuningmentioning

confidence: 99%

A Roadmap for Big Model

Yuan¹,

Zhao²,

Jiahong³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Moreover, we show that our method can be used in tandem with several parameter-efficient methods (He et al, 2021) in order to make the increase in time and space complexity due to skill-specific parameters negligible. In particular, we explore sparse adaptation with Lottery-Ticket Sparse Fine-Tuning (LT-SFT; Ansell et al, 2022) and low-rank adaptation with Low-Rank Adapters (LoRA;Hu et al, 2021).…”

Section: Fine-grained Skill Selectionmentioning

confidence: 99%

Combining Modular Skills in Multitask Learning

Ponti¹,

Sordoni²,

Bengio³

et al. 2022

Preprint

View full text Add to dashboard Cite

A modular design encourages neural models to disentangle and recombine different facets of knowledge to generalise more systematically to new tasks. In this work, we assume that each task is associated with a subset of latent discrete skills from a (potentially small) inventory. In turn, skills correspond to parameter-efficient (sparse / lowrank) model parameterisations. By jointly learning these and a task-skill allocation matrix, the network for each task is instantiated as the average of the parameters of active skills. To favour non-trivial soft partitions of skills across tasks, we experiment with a series of inductive biases, such as an Indian Buffet Process prior and a twospeed learning rate. We evaluate our latentskill model on two main settings: 1) multitask reinforcement learning for grounded instruction following on 8 levels of the BabyAI platform; and 2) few-shot adaptation of pre-trained text-to-text generative models on CrossFit, a benchmark comprising 160 NLP tasks. We find that the modular design of a network significantly increases sample-efficiency in reinforcement learning and few-shot generalisation in supervised learning, compared to baselines with fully shared, task-specific, or conditionally generated parameters where knowledge is entangled across tasks. In addition, we show how discrete skills help interpretability, as they yield an explicit hierarchy of tasks.

show abstract

“…Adapter: Freeze the pre-trained model and train a residual Adapter (Houlsby et al, 2019). ParallelAdapter: A variant by transferring the parallel insertion of prefix tuning into adapters (He et al, 2021). prompt-tuning (CLS/VER): which only tunes soft-prompts with a frozen language model (Lester et al, 2021), prompt for transformer's first layer.…”

Section: Baseline Modelsmentioning

confidence: 99%

“…The level of catastrophic forgetting in EANN (Wang et al, 2018) is somewhat reduced compared to Fine-tuning but is still severe. Prompt-tuning and p-tuning v2 are somewhat related to the adapter method in the form of parameter tuning (He et al, 2021), but their performance in CL differs. The prompt-based model is better than the adapter in both datasets.…”

Section: Main Experimentsmentioning

confidence: 99%

Continuous Detection, Rapidly React: Unseen Rumors Detection based on Continual Prompt-Tuning

Zuo¹,

Zhu²,

Cai³

2022

Preprint

View full text Add to dashboard Cite

Since open social platforms allow for a large and continuous flow of unverified information, rumors can emerge unexpectedly and spread quickly. However, existing rumor detection (RD) models often assume the same training and testing distributions and cannot cope with the continuously changing social network environment. This paper proposes a Continual Prompt-Tuning RD (CPT-RD) framework, which avoids catastrophic forgetting of upstream tasks during sequential task learning and enables knowledge transfer between domain tasks. To avoid forgetting, we optimize and store task-special soft-prompt for each domain. Furthermore, we also propose several strategies to transfer knowledge of upstream tasks to deal with emergencies and a taskconditioned prompt-wise hypernetwork (TPH-Net) to consolidate past domains, enabling bidirectional knowledge transfer. Finally, CPT-RD is evaluated on English and Chinese RD datasets and is effective and efficient compared to state-of-the-art baselines, without data replay techniques and with only a few parameter tuning. 1

show abstract

Towards a Unified View of Parameter-Efficient Transfer Learning

Cited by 31 publications

References 26 publications

A Roadmap for Big Model

A Roadmap for Big Model

Combining Modular Skills in Multitask Learning

Continuous Detection, Rapidly React: Unseen Rumors Detection based on Continual Prompt-Tuning

Contact Info

Product

Resources

About