Findings of the Association for Computational Linguistics: EMNLP 2021 2021
DOI: 10.18653/v1/2021.findings-emnlp.225
|View full text |Cite
|
Sign up to set email alerts
|

CDLM: Cross-Document Language Modeling

Abstract: We introduce a new pretraining approach geared for multi-document language modeling, incorporating two key ideas into the masked language modeling self-supervised objective. First, instead of considering documents in isolation, we pretrain over sets of multiple related documents, encouraging the model to learn cross-document relationships. Second, we improve over recent long-range transformers by introducing dynamic global attention that has access to the entire input to predict masked tokens. We release CDLM … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
14
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
2
1

Relationship

1
9

Authors

Journals

citations
Cited by 27 publications
(28 citation statements)
references
References 21 publications
0
14
0
Order By: Relevance
“…We first pretrain the LED-base model on the masked language modeling (MLM) task (Devlin et al, 2019) using related work sections from S2ORC papers in the computer science domain, as well as on the cross-document language modeling (CDLM) task (Caciularu et al, 2021), which aligns masked citation sentences with their context sentences and the full text of their cited papers. We further pretrain the LED encoder with the three CORWA sub-tasks (Supplementary Table 6).…”
Section: Experimental Settingmentioning
confidence: 99%
“…We first pretrain the LED-base model on the masked language modeling (MLM) task (Devlin et al, 2019) using related work sections from S2ORC papers in the computer science domain, as well as on the cross-document language modeling (CDLM) task (Caciularu et al, 2021), which aligns masked citation sentences with their context sentences and the full text of their cited papers. We further pretrain the LED encoder with the three CORWA sub-tasks (Supplementary Table 6).…”
Section: Experimental Settingmentioning
confidence: 99%
“…IE is a key component in supporting knowledge acquisition and it impacts a wide spectrum of knowledge-driven AI applications. We will conclude the tutorial by presenting further challenges and potential research topics in identifying trustworthiness of extracted content (Zhang et al, , 2020b, IE with quantitative reasoning (Elazar et al, 2019;, cross-document IE (Caciularu et al, 2021), incorporating domainspecific knowledge Zhang et al, 2021c), extension to knowledge reasoning and prediction, modeling of label semantics Mueller et al, 2022;Ma et al, 2022;Chen et al, 2020a), and challenges for acquiring implicit but essential information from corpora that potentially involve reporting bias (Sap et al, 2020).…”
Section: Future Research Directions [30min]mentioning
confidence: 99%
“…For completeness, we next show the relative advantage of our denoising method also when applied to several sentence-level downstream benchmarks. While contextualized embeddings domi-nate a wide range of sentence-and document-level NLP tasks (Peters et al, 2018;Devlin et al, 2019;Caciularu et al, 2021), we assessed the relative advantage of our denoising method when utilizing (non-contextualized) word embeddings in sentencean document-level settings. We applied the exact procedure proposed in Li et al (2017) and Rogers et al (2018), as an effective benchmark for the quality of static embedding models.…”
Section: Evaluations On Downstream Tasksmentioning
confidence: 99%