DEMix Layers: Disentangling Domains for Modular Language Modeling

Gururangan, Suchin; Lewis, Mike; Holtzman, Ari; Smith, Noah A.; Zettlemoyer, Luke

doi:10.18653/v1/2022.naacl-main.407

Cited by 26 publications

(29 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, our hierarchical adapter model permits adding modular components and we believe that it could potentially be used to detoxify language generation, following Liu et al (2021). This is in line with recent work on sparse models (Gururangan et al, 2021;Artetxe et al, 2021). does not correspond exactly to the cluster obtained by an unsupervised, data-driven approach.…”

Section: Limitations and Riskssupporting

confidence: 80%

“…Thus, using the GMM clusters and the hierarchical structure, without training more parameters, we are able to evaluate out-of-domain data using the adapters that were trained on the most related domains. This is similar to the "cached" setting in Gururangan et al (2021), and it does require a held-out set of N sequences that are only used for finding the best path through the tree (and not for computing perplexity). This is realistic setting when one has a significant amount of data from a single source, and we leave other approaches (e.g., finding the best path for every input sequence individually) to future work.…”

Section: Out-of-domain Resultsmentioning

confidence: 99%

“…Prior work typically assumes that individual domains are distinct, and models them accordingly. For example, Gururangan et al (2020Gururangan et al ( , 2021 train one model for each textual domain, either in a dense or sparse manner. This is related to data selection (Moore and Lewis, 2010;Axelrod et al, 2011;Plank and van Noord, 2011), which aims to select the best matching data for a new domain.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Efficient Hierarchical Domain Adaptation for Pretrained Language Models

Alexandra¹,

Peters²,

Dodge³

2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

The remarkable success of large language models has been driven by dense models trained on massive unlabeled, unstructured corpora. These corpora typically contain text from diverse, heterogeneous sources, but information about the source of the text is rarely used during training. Transferring their knowledge to a target domain is typically done by continuing training in-domain. In this paper, we introduce a method to permit domain adaptation to many diverse domains using a computationally efficient adapter approach. Our method is based on the observation that textual domains are partially overlapping, and we represent domains as a hierarchical tree structure where each node in the tree is associated with a set of adapter weights. When combined with a frozen pretrained language model, this approach enables parameter sharing among related domains, while avoiding negative interference between unrelated ones. Experimental results with GPT-2 and a large fraction of the 100 most represented websites in C4 show across-the-board improvements indomain. We additionally provide an inference time algorithm for a held-out domain and show that averaging over multiple paths through the tree enables further gains in generalization, while adding only a marginal cost to inference.

show abstract

Section: Limitations and Riskssupporting

confidence: 80%

Section: Out-of-domain Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Efficient Hierarchical Domain Adaptation for Pretrained Language Models

Alexandra¹,

Peters²,

Dodge³

2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

show abstract

“…They have recently re-gained interest for transformer-based models, where mix-ture of experts (MoE; Shazeer et al, 2017) approaches have enabled training trillion parameters models in a distributed fashion (Fedus et al, 2021). More recently modular MoE approaches have been shown to improve domain-specific pretraining of LMs (Gururangan et al, 2021). In a similar trend, 'expert' modules have been added to (non-modular) pre-trained LMs post-hoc, predominantly referred to as adapters (Rebuffi et al, 2017(Rebuffi et al, , 2018Houlsby et al, 2019).…”

Section: Modular Language Modelsmentioning

confidence: 99%

Lifting the Curse of Multilinguality by Pre-training Modular Transformers

Pfeiffer¹,

Goyal²,

Lin³

et al. 2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Multilingual pre-trained models are known to suffer from the curse of multilinguality, which causes per-language performance to drop as they cover more languages. We address this issue by introducing language-specific modules, which allows us to grow the total capacity of the model, while keeping the total number of trainable parameters per language constant. In contrast with prior work that learns languagespecific components post-hoc, we pre-train the modules of our Cross-lingual Modular (X-MOD) models from the start. Our experiments on natural language inference, named entity recognition and question answering show that our approach not only mitigates the negative interference between languages, but also enables positive transfer, resulting in improved monolingual and cross-lingual performance. Furthermore, our approach enables adding languages post-hoc with no measurable drop in performance, no longer limiting the model usage to the set of pre-trained languages.

show abstract

“…Lifelong learning is also a hot topic for PLMs. Some target at domain adaptation through continual pre-training (Gururangan et al, 2020), parameter-efficient adapters (He et al, 2021) and sparse expert models (Gururangan et al, 2021). Others focus on the incremental acquisition of factual knowledge that changes over time (Dhingra et al, 2021;Jang et al, 2021).…”

Section: Related Workmentioning

confidence: 99%

ELLE: Efficient Lifelong Pre-training for Emerging Data

Qin¹,

Zhang²,

Lin³

et al. 2022

Findings of the Association for Computational Linguistics: ACL 2022

View full text Add to dashboard Cite

Current pre-trained language models (PLM) are typically trained with static data, ignoring that in real-world scenarios, streaming data of various sources may continuously grow. This requires PLMs to integrate the information from all the sources in a lifelong manner. Although this goal could be achieved by exhaustive pretraining on all the existing data, such a process is known to be computationally expensive. To this end, we propose ELLE, aiming at efficient lifelong pre-training for emerging data. Specifically, ELLE consists of (1) function preserved model expansion, which flexibly expands an existing PLM's width and depth to improve the efficiency of knowledge acquisition; and (2) pre-trained domain prompts, which disentangle the versatile knowledge learned during pretraining and stimulate the proper knowledge for downstream tasks. We experiment ELLE with streaming data from 5 domains on BERT and GPT. The results show the superiority of ELLE over various lifelong learning baselines in both pre-training efficiency and downstream performances. The codes are publicly available at https://github.com/thunlp/ELLE.

show abstract

DEMix Layers: Disentangling Domains for Modular Language Modeling

Cited by 26 publications

References 39 publications

Efficient Hierarchical Domain Adaptation for Pretrained Language Models

Efficient Hierarchical Domain Adaptation for Pretrained Language Models

Lifting the Curse of Multilinguality by Pre-training Modular Transformers

ELLE: Efficient Lifelong Pre-training for Emerging Data

Contact Info

Product

Resources

About