CDLM: Cross-Document Language Modeling

Caciularu, Avi; Cohan, Arman; Beltagy, Iz; Peters, Matthew E.; Cattan, Arie; Dagan, Ido

doi:10.18653/v1/2021.findings-emnlp.225

Cited by 27 publications

(28 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We first pretrain the LED-base model on the masked language modeling (MLM) task (Devlin et al, 2019) using related work sections from S2ORC papers in the computer science domain, as well as on the cross-document language modeling (CDLM) task (Caciularu et al, 2021), which aligns masked citation sentences with their context sentences and the full text of their cited papers. We further pretrain the LED encoder with the three CORWA sub-tasks (Supplementary Table 6).…”

Section: Experimental Settingmentioning

confidence: 99%

CORWA: A Citation-Oriented Related Work Annotation Dataset

Li¹,

Mandal²,

Ouyang³

2022

Preprint

View full text Add to dashboard Cite

Academic research is an exploratory activity to discover new solutions to problems. By this nature, academic research works perform literature reviews to distinguish their novelties from prior work. In natural language processing, this literature review is usually conducted under the "Related Work" section. The task of related work generation aims to automatically generate the related work section given the rest of the research paper and a list of papers to cite. Prior work on this task has focused on the sentence as the basic unit of generation, neglecting the fact that related work sections consist of variable length text fragments derived from different information sources. As a first step toward a linguistically-motivated related work generation framework, we present a Citation Oriented Related Work Annotation (CORWA) dataset that labels different types of citation text fragments from different information sources. We train a strong baseline model that automatically tags the CORWA labels on massive unlabeled related work section texts. We further suggest a novel framework for human-in-the-loop, iterative, abstractive related work generation.

show abstract

Section: Experimental Settingmentioning

confidence: 99%

CORWA: A Citation-Oriented Related Work Annotation Dataset

Li¹,

Mandal²,

Ouyang³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…IE is a key component in supporting knowledge acquisition and it impacts a wide spectrum of knowledge-driven AI applications. We will conclude the tutorial by presenting further challenges and potential research topics in identifying trustworthiness of extracted content (Zhang et al, , 2020b, IE with quantitative reasoning (Elazar et al, 2019;, cross-document IE (Caciularu et al, 2021), incorporating domainspecific knowledge Zhang et al, 2021c), extension to knowledge reasoning and prediction, modeling of label semantics Mueller et al, 2022;Ma et al, 2022;Chen et al, 2020a), and challenges for acquiring implicit but essential information from corpora that potentially involve reporting bias (Sap et al, 2020).…”

Section: Future Research Directions [30min]mentioning

confidence: 99%

New Frontiers of Information Extraction

Chen¹,

Huang²,

Li³

et al. 2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

This tutorial targets researchers and practitioners who are interested in AI and ML technologies for structural information extraction (IE) from unstructured textual sources. In particular, this tutorial will provide audience with a systematic introduction to recent advances in IE, by addressing several important research questions. These questions include (i) how to develop a robust IE system from a small amount of noisy training data, while ensuring the reliability of its prediction? (ii) how to foster the generalizability of IE through enhancing the system's cross-lingual, crossdomain, cross-task and cross-modal transferability? (iii) how to support extracting structural information with extremely fine-grained and diverse labels? (iv) how to further improve IE by leveraging indirect supervision from other NLP tasks, such as Natural Language Generation (NLG), Natural Language Inference (NLI), Question Answering (QA) or summarization, and pre-trained language models? (v) how to acquire knowledge to guide inference in IE systems? We will discuss several lines of frontier research that tackle those challenges, and will conclude the tutorial by outlining directions for further investigation.

show abstract

“…For completeness, we next show the relative advantage of our denoising method also when applied to several sentence-level downstream benchmarks. While contextualized embeddings domi-nate a wide range of sentence-and document-level NLP tasks (Peters et al, 2018;Devlin et al, 2019;Caciularu et al, 2021), we assessed the relative advantage of our denoising method when utilizing (non-contextualized) word embeddings in sentencean document-level settings. We applied the exact procedure proposed in Li et al (2017) and Rogers et al (2018), as an effective benchmark for the quality of static embedding models.…”

Section: Evaluations On Downstream Tasksmentioning

confidence: 99%

Denoising Word Embeddings by Averaging in a Shared Space

Caciularu

Dagan

Goldberger

2021

Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics

Self Cite

View full text Add to dashboard Cite

We introduce a new approach for smoothing and improving the quality of word embeddings. We consider a method of fusing word embeddings that were trained on the same corpus but with different initializations. We project all the models to a shared vector space using an efficient implementation of the Generalized Procrustes Analysis (GPA) procedure, previously used in multilingual word translation. Our word representation demonstrates consistent improvements over the raw models as well as their simplistic average, on a range of tasks. As the new representations are more stable and reliable, there is a noticeable improvement in rare word evaluations.

show abstract

CDLM: Cross-Document Language Modeling

Cited by 27 publications

References 21 publications

CORWA: A Citation-Oriented Related Work Annotation Dataset

CORWA: A Citation-Oriented Related Work Annotation Dataset

New Frontiers of Information Extraction

Denoising Word Embeddings by Averaging in a Shared Space

Contact Info

Product

Resources

About