Findings of the Association for Computational Linguistics: ACL 2022 2022
DOI: 10.18653/v1/2022.findings-acl.220
|View full text |Cite
|
Sign up to set email alerts
|

ELLE: Efficient Lifelong Pre-training for Emerging Data

Abstract: Current pre-trained language models (PLM) are typically trained with static data, ignoring that in real-world scenarios, streaming data of various sources may continuously grow. This requires PLMs to integrate the information from all the sources in a lifelong manner. Although this goal could be achieved by exhaustive pretraining on all the existing data, such a process is known to be computationally expensive. To this end, we propose ELLE, aiming at efficient lifelong pre-training for emerging data. Specifica… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
1
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1

Relationship

1
5

Authors

Journals

citations
Cited by 6 publications
(2 citation statements)
references
References 14 publications
(18 reference statements)
0
1
0
Order By: Relevance
“…However, larger models require greater computational demands (Patterson et al, 2021). To this end, researchers propose to accelerate pre-training by mixed-precision training (Shoeybi et al, 2019), distributed training (Shoeybi et al, 2019), large batch optimization (You et al, 2020), etc. Another line of methods (Gong et al, 2019;Chen et al, 2022;Qin et al, 2022) proposes to pre-train larger PLMs progressively. They first train a small PLM, and then gradually increase the depth or width of the network based on parameter recycling (PR).…”
Section: Related Workmentioning
confidence: 99%
“…However, larger models require greater computational demands (Patterson et al, 2021). To this end, researchers propose to accelerate pre-training by mixed-precision training (Shoeybi et al, 2019), distributed training (Shoeybi et al, 2019), large batch optimization (You et al, 2020), etc. Another line of methods (Gong et al, 2019;Chen et al, 2022;Qin et al, 2022) proposes to pre-train larger PLMs progressively. They first train a small PLM, and then gradually increase the depth or width of the network based on parameter recycling (PR).…”
Section: Related Workmentioning
confidence: 99%
“…CPT is closely related to ELLE (Qin et al, 2022), which does continual pre-training. The key difference is that ELLE starts from random initialization, while our CPT starts from a pre-trained LM.…”
Section: Introductionmentioning
confidence: 99%