Merging Models with Fisher-Weighted Averaging

Matena, Michael; Raffel, Colin

doi:10.48550/arxiv.2111.09832

Cited by 7 publications

(12 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As an alternative to ensembling, we can also use parameter averaging (Izmailov et al, 2018;Wortsman et al, 2022a;Matena and Raffel, 2021) to collapse the ELMFOREST into a single LM. This operation keeps inference cost constant regardless of how many ELMs are added to the set.…”

Section: Averaging Elm Parametersmentioning

confidence: 99%

“…On the first iteration of BTM, E = ∅; we have no ELMs in the set to branch from. Instead of initializing the first ELMs of the set randomly, we hypothesize that ELM performance is boosted by branching from pretrained LM parameters, since multi-phase adaptive pretraining is an effective way to develop domain-specific language models (Gururangan et al, 2020), and parameter interpolation techniques work best with models that have a shared initialization (Izmailov et al, 2018;Frankle et al, 2020;Wortsman et al, 2022b;Matena and Raffel, 2021;Wortsman et al, 2022a) . Specifically, we perform a seed phase, training a seed LM θ seed on some data corpus d seed , which can be used to initialize the first batch of ELMs in the set.…”

Section: Step 0 (Initialization): Seeding the Elmforestmentioning

confidence: 99%

“…However, these works have found success averaging models trained from the same random initialization, which we do not find to hold in our setting. Matena and Raffel (2021) compute a parameter average of models, estimating the optimal weights via an approximation of the Fisher information. Future work may explore these (and other) variations of weighted averages with ELMs.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

Li¹,

Gururangan²,

Dettmers³

et al. 2022

Preprint

View full text Add to dashboard Cite

We present Branch-Train-Merge (BTM), a communication-efficient algorithm for embarrassingly parallel training of large language models (LLMs). We show it is possible to independently train subparts of a new class of LLMs on different subsets of the data, eliminating the massive multi-node synchronization currently required to train LLMs. BTM learns a set of independent EXPERT LMs (ELMs), each specialized to a different textual domain, such as scientific or legal text. These ELMs can be added and removed to update data coverage, ensembled to generalize to new domains, or averaged to collapse back to a single LM for efficient inference. New ELMs are learned by branching from (mixtures of) ELMs in the current set, further training the parameters on data for the new domain, and then merging the resulting model back into the set for future use. Experiments show that BTM improves in-and out-of-domain perplexities as compared to GPT-style Transformer LMs, when controlling for training cost. Through extensive analysis, we show that these results are robust to different ELM initialization schemes, but require expert domain specialization; LM ensembles with random data splits do not perform well. We also present a study of scaling BTM into a new corpus of 64 domains (192B whitespace-separated tokens in total); the resulting LM (22.4B total parameters) performs as well as a Transformer LM trained with 2.5× more compute. These gains grow with the number of domains, suggesting more aggressive parallelism could be used to efficiently train larger models in future work.

show abstract

Section: Averaging Elm Parametersmentioning

confidence: 99%

Section: Step 0 (Initialization): Seeding the Elmforestmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

Li¹,

Gururangan²,

Dettmers³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The results are reported on their development set following . MPQA (Wiebe et al, 2005) and Subj (Pang & Lee, 2004) are used for polarity and subjectivity detection, where we follow Matena and Raffel Matena & Raffel (2021) propose to merge pre-trained language models which are fine-tuned on various text classification tasks. Wortsman et al (2022) explores averaging model weights from various independent runs on the same task with different hyper-parameter configurations.…”

Section: Few-shot Performancementioning

confidence: 99%

“…Our adapter merging is inspired by recent works on model weight averaging like model soups (Wortsman et al, 2022) and multi BERTs (Devlin et al, 2019). Such weight averaging of models with different random initialization has been shown to improve model performance in recent works (Matena & Raffel, 2021;Neyshabur et al, 2020;Frankle et al, 2020) that show the optimized models to lie in the same basin of error landscape.…”

Section: Introductionmentioning

confidence: 99%

AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models

Wang¹,

Mukherjee²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters. This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation. Parameter-efficient techniques have been developed that tune small trainable components (e.g., adapters) injected in the large model while keeping most of the model weights frozen. The prevalent mechanism to increase adapter capacity is to increase the bottleneck dimension which increases the adapter parameters. In this work, we introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques. (i) We introduce multiple shared adapter components in each layer of the Transformer architecture. We leverage sparse learning via random routing to update the adapter parameters (encoder is kept frozen) resulting in the same amount of computational cost (FLOPs) as that of training a single adapter. (ii) We propose a simple merging mechanism to average the weights of multiple adapter components to collapse to a single adapter in each Transformer layer, thereby, keeping the overall parameters also the same but with significant performance improvement. We demonstrate these techniques to work well across multiple task settings including fully supervised and few-shot Natural Language Understanding tasks. By only tuning 0.23% of a pre-trained language model's parameters, our model 1 outperforms the full model fine-tuning performance and several competing methods.

show abstract

PoE: A Panel of Experts for Generalized Automatic Dialogue Assessment

Zhang

D’Haro

Zhang

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Chatbots are expected to be knowledgeable across multiple domains, e.g. for daily chit-chat, exchange of information, and grounding in emotional situations. To effectively measure the quality of such conversational agents, a modelbased automatic dialogue evaluation metric (ADEM) is expected to perform well across multiple domains. Despite significant progress, existing ADEMs tend to perform well only on data that are similar to its training data (overfit to its training domain). This calls for a domain-generalized metric that can assess dialogues of different characteristics. To this end, we propose a Panel of Experts (PoE), a multitask network that consists of a shared transformer encoder and a collection of lightweight adapters. The shared encoder captures the general knowledge of dialogues across domains, while each adapter specializes in one specific domain and serves as a domain expert. To validate the idea, we construct a high-quality multi-domain dialogue dataset leveraging data augmentation and pseudo-labeling. The PoE network is comprehensively assessed on 16 dialogue evaluation datasets spanning a wide range of dialogue domains. It achieves state-of-the-art performance in terms of mean Spearman correlation over all the evaluation datasets. It exhibits better zeroshot generalization than existing state-of-the-art ADEMs and the ability to easily adapt to new domains with few-shot transfer learning.

show abstract

Merging Models with Fisher-Weighted Averaging

Cited by 7 publications

References 40 publications

Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models

PoE: A Panel of Experts for Generalized Automatic Dialogue Assessment

Contact Info

Product

Resources

About