Bringing Structure into Summaries: a Faceted Summarization Dataset for Long Scientific Documents

Meng, Rui; Thaker, Khushboo; Zhang, Lei; Dong, Yue; Yuan, Xingdi; Wang, Tong; He, Daqing

doi:10.18653/v1/2021.acl-short.137

Cited by 15 publications

(20 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Each article consists of a background paragraph about the issue, along with a set of questions about the issue and short answers to those questions. FacetSum (Meng et al, 2021) is a found dataset consisting of a corpus of scientific papers paired with author-written summaries focusing on different aspects of the paper. WikiAsp (Hayashi et al, 2021) and AQuaMuSe (Kulkarni et al, 2020) are two heuristically created, multidocument QFS datasets derived from Wikipedia.…”

Section: Question-focused Summarizationmentioning

confidence: 99%

“…For example, many researchers and organizations are unwilling to host or distribute the CNN/DailyMail dataset, 1 despite it being one of the most popular summarization datasets to experiment on. Similarly, several recent summarization datasets built on data such as scientific journal papers (Meng et al, 2021) or SparkNotes book summaries (Ladhak et al, 2020; have never been made available to researchers, with the dataset creators instead asking potential data users to rescrape them individually, which can be a serious obstacle to reproducibility.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

SQuALITY: Building a Long-Document Summarization Dataset the Hard Way

Wang¹,

Pang²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Summarization datasets are often assembled either by scraping naturally occurring publicdomain summaries-which are nearly always in difficult-to-work-with technical domainsor by using approximate heuristics to extract them from everyday text-which frequently yields unfaithful summaries. In this work, we turn to a slower but more straightforward approach to developing summarization benchmark data: We hire highly-qualified contractors to read stories and write original summaries from scratch. To amortize reading time, we collect five summaries per document, with the first giving an overview and the subsequent four addressing specific questions. We use this protocol to collect SQuALITY, a dataset of question-focused summaries built on the same public-domain short stories as the multiplechoice dataset QuALITY (Pang et al., 2021b). Experiments with state-of-the-art summarization systems show that our dataset is challenging and that existing automatic evaluation metrics are weak indicators of quality.

show abstract

Section: Question-focused Summarizationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

SQuALITY: Building a Long-Document Summarization Dataset the Hard Way

Wang¹,

Pang²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Yasunaga et al [57] efficiently create a dataset for the computational linguistics domain by manually exploiting the structure of papers. Meng et al [38] present a dataset which contains four summaries from different aspects for each paper, which makes it possible to provide summaries depending on requests by users. Lu et al [35] is a large-scale dataset for multi-document summarization for scientific papers, for which models need to summarize multiple documents.…”

Section: Related Workmentioning

confidence: 99%

X-Scitldr

Takeshita

Green

Friedrich

et al. 2022

Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries

View full text Add to dashboard Cite

The number of scientific publications nowadays is rapidly increasing, causing information overload for researchers and making it hard for scholars to keep up to date with current trends and lines of work. Consequently, recent work on applying text mining technologies for scholarly publications has investigated the application of automatic text summarization technologies, including extreme summarization, for this domain. However, previous work has concentrated only on monolingual settings, primarily in English. In this paper, we fill this research gap and present an abstractive cross-lingual summarization dataset for four different languages in the scholarly domain, which enables us to train and evaluate models that process English papers and generate summaries in German, Italian, Chinese and Japanese. We present our new X-SCITLDR dataset for multilingual summarization and thoroughly benchmark different models based on a state-of-the-art multilingual pre-trained model, including a two-stage 'summarize and translate' approach and a direct cross-lingual model. We additionally explore the benefits of intermediate-stage training using English monolingual summarization and machine translation as intermediate tasks and analyze performance in zero-and few-shot scenarios. CCS Concepts• Computing methodologies → Natural language processing; Natural language generation; Language resources.

show abstract

“…While hierarchical encoding has been investigated (Zhang et al, 2019;Balachandran et al, 2021), its need for training large amounts of additional parameters leads to increased memory footprint and thus limits the allowed input length. As for the output, the structure of single document summaries remains largely "flat", such as a list of aspects (Meng et al, 2021). We argue that it is imperative to develop systems that can output summaries with rich structures to support knowledge acquisition, which is especially critical for long documents that cover numerous subjects with varying details (Huang et al, 2021;Kryściński et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

HIBRIDS: Attention with Hierarchical Biases for Structure-aware Long Document Summarization

Cao¹,

Wang²

2022

Preprint

View full text Add to dashboard Cite

Document structure is critical for efficient information consumption. However, it is challenging to encode it efficiently into the modern Transformer architecture. In this work, we present HIBRIDS, which injects Hierarchical Biases foR Incorporating Document Structure into the calculation of attention scores. We further present a new task, hierarchical questionsummary generation, for summarizing salient content in the source document into a hierarchy of questions and summaries, where each follow-up question inquires about the content of its parent question-summary pair. We also annotate a new dataset with 6, 153 questionsummary hierarchies labeled on long government reports. Experiment results show that our model produces better question-summary hierarchies than comparisons on both hierarchy quality and content coverage, a finding also echoed by human judges. Additionally, our model improves the generation of longform summaries from lengthy government reports and Wikipedia articles, as measured by ROUGE scores.1 Our code and newly collected data can be found at https://shuyangcao.github.io/projects/ structure_long_summ.

show abstract

Bringing Structure into Summaries: a Faceted Summarization Dataset for Long Scientific Documents

Cited by 15 publications

References 24 publications

SQuALITY: Building a Long-Document Summarization Dataset the Hard Way

SQuALITY: Building a Long-Document Summarization Dataset the Hard Way

X-Scitldr

HIBRIDS: Attention with Hierarchical Biases for Structure-aware Long Document Summarization

Contact Info

Product

Resources

About