2021
DOI: 10.48550/arxiv.2111.09832
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Merging Models with Fisher-Weighted Averaging

Abstract: Transfer learning provides a way of leveraging knowledge from one task when learning another task. Performing transfer learning typically involves iteratively updating a model's parameters through gradient descent on a training dataset. In this paper, we introduce a fundamentally different method for transferring knowledge across models that amounts to "merging" multiple models into one. Our approach effectively involves computing a weighted average of the models' parameters. We show that this averaging is equ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
11
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(12 citation statements)
references
References 40 publications
1
11
0
Order By: Relevance
“…As an alternative to ensembling, we can also use parameter averaging (Izmailov et al, 2018;Wortsman et al, 2022a;Matena and Raffel, 2021) to collapse the ELMFOREST into a single LM. This operation keeps inference cost constant regardless of how many ELMs are added to the set.…”
Section: Averaging Elm Parametersmentioning
confidence: 99%
See 2 more Smart Citations
“…As an alternative to ensembling, we can also use parameter averaging (Izmailov et al, 2018;Wortsman et al, 2022a;Matena and Raffel, 2021) to collapse the ELMFOREST into a single LM. This operation keeps inference cost constant regardless of how many ELMs are added to the set.…”
Section: Averaging Elm Parametersmentioning
confidence: 99%
“…On the first iteration of BTM, E = ∅; we have no ELMs in the set to branch from. Instead of initializing the first ELMs of the set randomly, we hypothesize that ELM performance is boosted by branching from pretrained LM parameters, since multi-phase adaptive pretraining is an effective way to develop domain-specific language models (Gururangan et al, 2020), and parameter interpolation techniques work best with models that have a shared initialization (Izmailov et al, 2018;Frankle et al, 2020;Wortsman et al, 2022b;Matena and Raffel, 2021;Wortsman et al, 2022a) . Specifically, we perform a seed phase, training a seed LM θ seed on some data corpus d seed , which can be used to initialize the first batch of ELMs in the set.…”
Section: Step 0 (Initialization): Seeding the Elmforestmentioning
confidence: 99%
See 1 more Smart Citation
“…The results are reported on their development set following . MPQA (Wiebe et al, 2005) and Subj (Pang & Lee, 2004) are used for polarity and subjectivity detection, where we follow Matena and Raffel Matena & Raffel (2021) propose to merge pre-trained language models which are fine-tuned on various text classification tasks. Wortsman et al (2022) explores averaging model weights from various independent runs on the same task with different hyper-parameter configurations.…”
Section: Few-shot Performancementioning
confidence: 99%
“…Our adapter merging is inspired by recent works on model weight averaging like model soups (Wortsman et al, 2022) and multi BERTs (Devlin et al, 2019). Such weight averaging of models with different random initialization has been shown to improve model performance in recent works (Matena & Raffel, 2021;Neyshabur et al, 2020;Frankle et al, 2020) that show the optimized models to lie in the same basin of error landscape.…”
Section: Introductionmentioning
confidence: 99%