Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2019
DOI: 10.18653/v1/p19-1502
|View full text |Cite
|
Sign up to set email alerts
|

Studying Summarization Evaluation Metrics in the Appropriate Scoring Range

Abstract: In summarization, automatic evaluation metrics are usually compared based on their ability to correlate with human judgments. Unfortunately, the few existing human judgment datasets have been created as by-products of the manual evaluations performed during the DUC/TAC shared tasks. However, modern systems are typically better than the best systems submitted at the time of these shared tasks. We show that, surprisingly, evaluation metrics which behave similarly on these datasets (average-scoring range) strongl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

4
38
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 46 publications
(47 citation statements)
references
References 15 publications
4
38
0
Order By: Relevance
“…Results reveal that the lexical metric SENTBLEU can correctly assign lower scores to system translations of low quality, while it struggles in judging system translations of high quality by assigning them lower scores. Our finding agrees with the observations found inChaganty et al (2018);Novikova et al (2017): lexical metrics correlate better with human judgments on texts of low quality than high quality Peyrard (2019b). further show that lexical metrics cannot be trusted because…”
supporting
confidence: 91%
See 1 more Smart Citation
“…Results reveal that the lexical metric SENTBLEU can correctly assign lower scores to system translations of low quality, while it struggles in judging system translations of high quality by assigning them lower scores. Our finding agrees with the observations found inChaganty et al (2018);Novikova et al (2017): lexical metrics correlate better with human judgments on texts of low quality than high quality Peyrard (2019b). further show that lexical metrics cannot be trusted because…”
supporting
confidence: 91%
“…Our finding agrees with the observations found in Chaganty et al (2018); Novikova et al (2017): lexical metrics correlate better with human judgments on texts of low quality than high quality. Peyrard (2019b) further show that lexical metrics cannot be trusted because Machine Translation (zh-en) Figure 3: Correlation in similar language (de-en) and distant language (zh-en) pair, where bordered area shows correlations between human assessment and metrics, the rest shows inter-correlations across metrics and DA is direct assessment rated by language experts. they strongly disagree on high-scoring system outputs.…”
Section: Further Analysismentioning
confidence: 99%
“…• One can repeat our bias study on evaluation metrics. Peyrard (2019b) showed that widely used evaluation metrics (e.g., ROUGE, Jensen-Shannon divergence) are strongly mismatched in scoring summary results. One can compare different measures (e.g., n-gram recall, sentence overlaps, embedding similarities, word connectedness, centrality, importance reflected by discourse structures), and study bias of each with respect to systems and corpora.…”
Section: Discussionmentioning
confidence: 99%
“…However, the classic TAC meta-evaluation datasets are now 6-12 years old 2 and it is not clear whether conclusions found there will hold with modern systems and summarization tasks. Two earlier works exemplify this disconnect: (1) Peyrard (2019) observed that the human-annotated summaries in the TAC dataset are mostly of lower quality than those produced by modern systems and that various automated evaluation metrics strongly disagree in the higher-scoring range in which current systems now operate. (2) Rankel et al (2013) observed that the correlation between ROUGE and human judgments in the TAC dataset decreases when looking at the best systems only, even for systems from eight years ago, which are far from today's state-of-the-art.…”
Section: Introductionmentioning
confidence: 99%