Proceedings of the Workshop on New Frontiers in Summarization 2017
DOI: 10.18653/v1/w17-4510
|View full text |Cite
|
Sign up to set email alerts
|

Learning to Score System Summaries for Better Content Selection Evaluation.

Abstract: The evaluation of summaries is a challenging but crucial task of the summarization field. In this work, we propose to learn an automatic scoring metric based on the human judgements available as part of classical summarization datasets like TAC-2008 and TAC-2009. Any existing automatic scoring metrics can be included as features, the model learns the combination exhibiting the best correlation with human judgments. The reliability of the new metric is tested in a further manual evaluation where we ask humans … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
38
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 64 publications
(46 citation statements)
references
References 18 publications
1
38
0
Order By: Relevance
“…(3) Previous meta-evaluation studies (Novikova et al, 2017;Peyrard et al, 2017;Chaganty et al, 2018) conclude that automatic metrics tend to correlate well with humans at the system level but have poor correlations at the instance (here summary) level. We find this observation only holds on TAC-2008.…”
Section: Exp-iv: Evaluating Summariesmentioning
confidence: 99%
See 2 more Smart Citations
“…(3) Previous meta-evaluation studies (Novikova et al, 2017;Peyrard et al, 2017;Chaganty et al, 2018) conclude that automatic metrics tend to correlate well with humans at the system level but have poor correlations at the instance (here summary) level. We find this observation only holds on TAC-2008.…”
Section: Exp-iv: Evaluating Summariesmentioning
confidence: 99%
“…For example, MoverScore is the best performing metric for evaluating summaries on dataset TAC, but it is significantly worse than ROUGE-2 on our collected CNNDM set. Additionally, many previous works (Novikova et al, 2017;Peyrard et al, 2017;Chaganty et al, 2018) show that metrics have much lower correlations at comparing summaries than systems. For extractive summaries on CNNDM, however, most metrics are better at comparing summaries than systems.…”
Section: Introductionmentioning
confidence: 97%
See 1 more Smart Citation
“…Additionally, both rewards require reference summaries. Louis and Nenkova (2013), Peyrard et al (2017) and build featurerich regression models to learn a summary evaluation metric directly from the human judgement scores (Pyramid and Responsiveness) provided in the TAC'08 and '09 datasets 1 . Some features they use require reference summaries (e.g.…”
Section: Related Workmentioning
confidence: 99%
“…ROUGE variants are based on word sequence overlap between a system summary and a reference summary, where each variant measures a different aspect of text comparison. Despite its pitfalls, ROUGE has shown reasonable correlation of its system scores to those obtained by manual evaluation methods (Lin, 2004;Over and James, 2004;Over et al, 2007;Nenkova et al, 2007;Louis and Nenkova, 2013;Peyrard et al, 2017), such as SEE (Lin, 2001), responsiveness (NIST, 2006) and Pyramid (Nenkova et al, 2007).…”
Section: Case Study Analysismentioning
confidence: 99%