BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models

Zaken, Elad Ben; Ravfogel, Shauli; Goldberg, Yoav

doi:10.18653/v1/2022.acl-short.1

Cited by 197 publications

(169 citation statements)

References 17 publications

Supporting

Mentioning

103

Contrasting

Order By: Relevance

“…Fine-tuning of large-scale language models (LMs) to get specialized models for specific tasks is known to be the best practice for optimizing task performance (Devlin et al, 2019;Aribandi et al, 2022) but is achieved at the significant cost of training and serving specialized models for many tasks. This motivates recent research on parameterefficient tuning (Houlsby et al, 2019;Li and Liang, 2021;Ben Zaken et al, 2022), which focuses on tuning specialized models by updating a small number of their parameters. Yet, those specialized models fail to benefit from knowledge transfer across many tasks and leverage rich cross-task data (Liu…”

Section: Introductionmentioning

confidence: 92%

“…Parameter-efficient transfer learning. In addition to the approaches discussed in the previous sections (Houlsby et al, 2019;Ben Zaken et al, 2022;Li and Liang, 2021;Lester et al, 2021;Vu et al, 2022), many parameter-efficient transfer approaches have been introduced recently. Adapter-Fusion…”

Section: Additional Related Workmentioning

confidence: 99%

“…Recent work also found that it can preserve the rich pre-trained knowledge since it freezes most of θ and avoids the aforementioned forgetting issue (He et al, 2021;Lester et al, 2021). Several approaches introduce additional modules to LMs or directly update a small number of LM parameters; Adapter (Houlsby et al, 2019) and its variants insert trainable layers to the LMs for each task, and BitFit (Ben Zaken et al, 2022) updates LM biases. In contrast, prefix-tuning (Li and Liang, 2021) and prompt tuning (Lester et al, 2021) freeze the original LMs and update only trainable soft prompts prepended to input.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

ATTEMPT: Parameter-Efficient Multi-task Tuning via Attentional Mixtures of Soft Prompts

Asai¹,

Salehi²,

Peters³

et al. 2022

Preprint

View full text Add to dashboard Cite

Thiswork introduces ATTEMPT (ATTEntional Mixture of Prompt Tuning), a new modular, multi-task, and parameterefficient language model (LM) tuning approach that combines knowledge transferred across different tasks via a mixture of soft prompts while keeping original LM unchanged. ATTEMPT interpolates a set of prompts trained on large-scale source tasks and a newly initialized target task prompt using instance-wise attention computed by a light-weight sub-network trained on multiple target tasks. ATTEMPT is parameter-efficient (e.g., updates 1,600 times fewer parameters than fine-tuning) and enables multi-task learning and flexible extensions; importantly, it is also more interpretable because it demonstrates which source tasks affect the final model decision on target tasks. Experimental results across 17 diverse datasets show that ATTEMPT improves prompt tuning by up to a 22% absolute performance gain and outperforms or matches fully fine-tuned or other parameter-efficient tuning approaches that use over ten times more parameters. 1

show abstract

Section: Introductionmentioning

confidence: 92%

Section: Additional Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

ATTEMPT: Parameter-Efficient Multi-task Tuning via Attentional Mixtures of Soft Prompts

Asai¹,

Salehi²,

Peters³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Structured pruning for finetuning specifically has seen various new findings. Ben-Zaken et al (2021) propose Bias-terms Fine-tuning (BitFit) which freezes all pre-trained weights aside from bias terms for finetuning which results in diff masks with less than 0.1% of original parameters. Since it does not introduce any new parameters or stochastic gates, this method is very simple to implement while almost reaching the performance of DiffPruning with BERT large on the GLUE benchmark.…”

Section: Parameter-efficient Learningmentioning

confidence: 99%

Parameter Efficient Diff Pruning for Bias Mitigation

Hauzenberger¹,

Rekabsaz²

2022

Preprint

View full text Add to dashboard Cite

In recent years language models have achieved state of the art performance on a wide variety of natural language processing tasks. As these models are continuously growing in size it becomes increasingly important to explore methods to make them more storage efficient. At the same time their increase cognitive abilities increase the danger that societal bias existing in datasets are implicitly encoded in the model weights. We propose an architecture which deals with these two challenges at the same time using two techniques: DiffPruning and Adverserial Training. The result is a modular architecture which extends the original Diff-Purning setup with and additional sparse subnetwork applied as a mask to diminish the effects of a predefined protected attribute at inference time.

show abstract

“…Recent works on parameter-efficient (PE) 1 finetuning address this issue by introducing methods that alternatively rely on only changing a tiny set of extra parameters (Houlsby et al, 2019;Li and Liang, 2021;Hambardzumyan et al, 2021;Lester et al, 2021;Hu et al, 2022;He et al, 2022) or a small fraction of the existing model's parameters (Zaken et al, 2021;Gheini et al, 2021). These methods have been shown to be competitive with full fine-tuning despite modifying only as little as 0.01% of all the parameters (Liu et al, 2022).…”

Section: Introductionmentioning

confidence: 99%

Know Where You're Going: Meta-Learning for Parameter-Efficient Fine-tuning

Gheini¹,

Ma²,

May³

2022

Preprint

View full text Add to dashboard Cite

A recent family of techniques, dubbed as lightweight fine-tuning methods, facilitates parameter-efficient transfer learning by updating only a small set of additional parameters while keeping the parameters of the pretrained language model frozen. While proven to be an effective method, there are no existing studies on if and how such knowledge of the downstream fine-tuning approach should affect the pretraining stage. In this work, we show that taking the ultimate choice of fine-tuning method into consideration boosts the performance of parameter-efficient fine-tuning. By relying on optimization-based meta-learning using MAML with certain modifications for our distinct purpose, we prime the pretrained model specifically for parameter-efficient finetuning, resulting in gains of up to 1.7 points on cross-lingual NER fine-tuning. Our ablation settings and analyses further reveal that the tweaks we introduce in MAML are crucial for the attained gains. 1 We use descriptors "parameter-efficient" and "lightweight" interchangeably.

show abstract

BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models

Cited by 197 publications

References 17 publications

ATTEMPT: Parameter-Efficient Multi-task Tuning via Attentional Mixtures of Soft Prompts

ATTEMPT: Parameter-Efficient Multi-task Tuning via Attentional Mixtures of Soft Prompts

Parameter Efficient Diff Pruning for Bias Mitigation

Know Where You're Going: Meta-Learning for Parameter-Efficient Fine-tuning

Contact Info

Product

Resources

About