Learning to Score System Summaries for Better Content Selection Evaluation.

Peyrard, Maxime; Botschen, Teresa; Gurevych, Iryna

doi:10.18653/v1/w17-4510

Cited by 64 publications

(46 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(3) Previous meta-evaluation studies (Novikova et al, 2017;Peyrard et al, 2017;Chaganty et al, 2018) conclude that automatic metrics tend to correlate well with humans at the system level but have poor correlations at the instance (here summary) level. We find this observation only holds on TAC-2008.…”

Section: Exp-iv: Evaluating Summariesmentioning

confidence: 99%

“…For example, MoverScore is the best performing metric for evaluating summaries on dataset TAC, but it is significantly worse than ROUGE-2 on our collected CNNDM set. Additionally, many previous works (Novikova et al, 2017;Peyrard et al, 2017;Chaganty et al, 2018) show that metrics have much lower correlations at comparing summaries than systems. For extractive summaries on CNNDM, however, most metrics are better at comparing summaries than systems.…”

Section: Introductionmentioning

confidence: 97%

“…In text summarization, manual evaluation, as exemplified by the Pyramid method (Nenkova and Passonneau, 2004), is the gold-standard in evaluation. However, due to time required and relatively high cost of annotation, the great majority of research papers on summarization use exclusively automatic evaluation metrics, such as ROUGE (Lin, 2004) , JS-2 (Louis and Nenkova, 2013), S3 (Peyrard et al, 2017), BERTScore (Zhang et al, 2020), Mover-Score (Zhao et al, 2019) etc. Among these metrics, ROUGE is by far the most popular, and there is relatively little discussion of how ROUGE may deviate from human judgment and the potential for this deviation to change conclusions drawn regarding relative merit of baseline and proposed methods.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Re-evaluating Evaluation in Text Summarization

Bhandari

Gour

Ashfaq

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

100

View full text Add to dashboard Cite

Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks such as text summarization. However, while the field has progressed, our standard metrics have not -for nearly 20 years ROUGE has been the standard evaluation in most summarization papers. In this paper, we make an attempt to re-evaluate the evaluation method for text summarization: assessing the reliability of automatic metrics using top-scoring system outputs, both abstractive and extractive, on recently popular datasets for both systemlevel and summary-level evaluation settings. We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems. We release a dataset of human judgments that are collected from 25 top-scoring neural summarization systems (14 abstractive and 11 extractive):

show abstract

Section: Exp-iv: Evaluating Summariesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 97%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Re-evaluating Evaluation in Text Summarization

Bhandari

Gour

Ashfaq

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

100

View full text Add to dashboard Cite

show abstract

“…Additionally, both rewards require reference summaries. Louis and Nenkova (2013), Peyrard et al (2017) and build featurerich regression models to learn a summary evaluation metric directly from the human judgement scores (Pyramid and Responsiveness) provided in the TAC'08 and '09 datasets 1 . Some features they use require reference summaries (e.g.…”

Section: Related Workmentioning

confidence: 99%

Better Rewards Yield Better Summaries: Learning to Summarise Without References

Böhm¹,

Gao²,

Meyer³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

Self Cite

View full text Add to dashboard Cite

Reinforcement Learning (RL) based document summarisation systems yield state-of-the-art performance in terms of ROUGE scores, because they directly use ROUGE as the rewards during training. However, summaries with high ROUGE scores often receive low human judgement. To find a better reward function that can guide RL to generate human-appealing summaries, we learn a reward function from human ratings on 2,500 summaries. Our reward function only takes the document and system summary as input. Hence, once trained, it can be used to train RL-based summarisation systems without using any reference summaries. We show that our learned rewards have significantly higher correlation with human ratings than previous approaches. Human evaluation experiments show that, compared to the state-of-the-art supervised-learning systems and ROUGE-as-rewards RL summarisation systems, the RL systems using our learned rewards during training generate summaries with higher human ratings. The learned reward function and our source code are available at https://github.com/yg211/ summary-reward-no-reference.

show abstract

“…ROUGE variants are based on word sequence overlap between a system summary and a reference summary, where each variant measures a different aspect of text comparison. Despite its pitfalls, ROUGE has shown reasonable correlation of its system scores to those obtained by manual evaluation methods (Lin, 2004;Over and James, 2004;Over et al, 2007;Nenkova et al, 2007;Louis and Nenkova, 2013;Peyrard et al, 2017), such as SEE (Lin, 2001), responsiveness (NIST, 2006) and Pyramid (Nenkova et al, 2007).…”

Section: Case Study Analysismentioning

confidence: 99%

Evaluating Multiple System Summary Lengths: A Case Study

Shapira¹,

Gabay²,

Ronen

et al. 2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Practical summarization systems are expected to produce summaries of varying lengths, per user needs. While a couple of early summarization benchmarks tested systems across multiple summary lengths, this practice was mostly abandoned due to the assumed cost of producing reference summaries of multiple lengths. In this paper, we raise the research question of whether reference summaries of a single length can be used to reliably evaluate system summaries of multiple lengths. For that, we have analyzed a couple of datasets as a case study, using several variants of the ROUGE metric that are standard in summarization evaluation. Our findings indicate that the evaluation protocol in question is indeed competitive. This result paves the way to practically evaluating varying-length summaries with simple, possibly existing, summarization benchmarks.

show abstract

Learning to Score System Summaries for Better Content Selection Evaluation.

Cited by 64 publications

References 18 publications

Re-evaluating Evaluation in Text Summarization

Re-evaluating Evaluation in Text Summarization

Better Rewards Yield Better Summaries: Learning to Summarise Without References

Evaluating Multiple System Summary Lengths: A Case Study

Contact Info

Product

Resources

About