Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua 2022
DOI: 10.18653/v1/2022.naacl-main.407
|View full text |Cite
|
Sign up to set email alerts
|

DEMix Layers: Disentangling Domains for Modular Language Modeling

Abstract: We introduce a new domain expert mixture (DEMIX) layer that enables conditioning a language model (LM) on the domain of the input text. A DEMIX layer includes a collection of expert feedforward networks, each specialized to a domain, that makes the LM modular: experts can be mixed, added, or removed after initial training. Extensive experiments with autoregressive transformer LMs (up to 1.3B parameters) show that DEMIX layers reduce testtime perplexity (especially for out-of-domain data), increase training eff… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
19
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 26 publications
(29 citation statements)
references
References 39 publications
2
19
0
Order By: Relevance
“…However, our hierarchical adapter model permits adding modular components and we believe that it could potentially be used to detoxify language generation, following Liu et al (2021). This is in line with recent work on sparse models (Gururangan et al, 2021;Artetxe et al, 2021). does not correspond exactly to the cluster obtained by an unsupervised, data-driven approach.…”
Section: Limitations and Riskssupporting
confidence: 80%
See 2 more Smart Citations
“…However, our hierarchical adapter model permits adding modular components and we believe that it could potentially be used to detoxify language generation, following Liu et al (2021). This is in line with recent work on sparse models (Gururangan et al, 2021;Artetxe et al, 2021). does not correspond exactly to the cluster obtained by an unsupervised, data-driven approach.…”
Section: Limitations and Riskssupporting
confidence: 80%
“…Thus, using the GMM clusters and the hierarchical structure, without training more parameters, we are able to evaluate out-of-domain data using the adapters that were trained on the most related domains. This is similar to the "cached" setting in Gururangan et al (2021), and it does require a held-out set of N sequences that are only used for finding the best path through the tree (and not for computing perplexity). This is realistic setting when one has a significant amount of data from a single source, and we leave other approaches (e.g., finding the best path for every input sequence individually) to future work.…”
Section: Out-of-domain Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…They have recently re-gained interest for transformer-based models, where mix-ture of experts (MoE; Shazeer et al, 2017) approaches have enabled training trillion parameters models in a distributed fashion (Fedus et al, 2021). More recently modular MoE approaches have been shown to improve domain-specific pretraining of LMs (Gururangan et al, 2021). In a similar trend, 'expert' modules have been added to (non-modular) pre-trained LMs post-hoc, predominantly referred to as adapters (Rebuffi et al, 2017(Rebuffi et al, , 2018Houlsby et al, 2019).…”
Section: Modular Language Modelsmentioning
confidence: 99%
“…Lifelong learning is also a hot topic for PLMs. Some target at domain adaptation through continual pre-training (Gururangan et al, 2020), parameter-efficient adapters (He et al, 2021) and sparse expert models (Gururangan et al, 2021). Others focus on the incremental acquisition of factual knowledge that changes over time (Dhingra et al, 2021;Jang et al, 2021).…”
Section: Related Workmentioning
confidence: 99%