2021
DOI: 10.48550/arxiv.2106.11388
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

How well do you know your summarization datasets?

Abstract: State-of-the-art summarization systems are trained and evaluated on massive datasets scraped from the web. Despite their prevalence, we know very little about the underlying characteristics (data noise, summarization complexity, etc.) of these datasets, and how these affect system performance and the reliability of automatic metrics like ROUGE. In this study, we manually analyse 600 samples from three popular summarization datasets. Our study is driven by a six-class typology which captures different noise typ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(4 citation statements)
references
References 46 publications
0
4
0
Order By: Relevance
“…Glaringly, in our experiment, the problem of noisy data afects more than 60% of the annotated ground truth summaries in the randomly sampled arXiv test set. This is greater than 54% detected in XSUM dataset [111]. While many of the errors and noises are minor, more than 15% of the reference summaries have signiicant errors where at least half of the summary contains errors, rendering the summaries to be unreadable.…”
Section: Fine-grained Analysis On Arxivmentioning
confidence: 85%
See 3 more Smart Citations
“…Glaringly, in our experiment, the problem of noisy data afects more than 60% of the annotated ground truth summaries in the randomly sampled arXiv test set. This is greater than 54% detected in XSUM dataset [111]. While many of the errors and noises are minor, more than 15% of the reference summaries have signiicant errors where at least half of the summary contains errors, rendering the summaries to be unreadable.…”
Section: Fine-grained Analysis On Arxivmentioning
confidence: 85%
“…For example, Beltagy et al [1] ine-tuned BERT on large-scale scientiic paper datasets and have found its performance to improve in scientiic domains as compared to the BERT-base model. Evidenced by Tejaswin et al [111]'s experimental result where BERTScore is found not to discriminate summaries with and without errors well, this questions the use of BERT-base model as the "independent evaluator" of candidate summaries across all domains.…”
Section: A) Hard Lexical Overlapmentioning
confidence: 96%
See 2 more Smart Citations