Studying Summarization Evaluation Metrics in the Appropriate Scoring Range

Peyrard, Maxime

doi:10.18653/v1/p19-1502

Cited by 46 publications

(47 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Results reveal that the lexical metric SENTBLEU can correctly assign lower scores to system translations of low quality, while it struggles in judging system translations of high quality by assigning them lower scores. Our finding agrees with the observations found inChaganty et al (2018);Novikova et al (2017): lexical metrics correlate better with human judgments on texts of low quality than high quality Peyrard (2019b). further show that lexical metrics cannot be trusted because…”

supporting

confidence: 91%

“…Our finding agrees with the observations found in Chaganty et al (2018); Novikova et al (2017): lexical metrics correlate better with human judgments on texts of low quality than high quality. Peyrard (2019b) further show that lexical metrics cannot be trusted because Machine Translation (zh-en) Figure 3: Correlation in similar language (de-en) and distant language (zh-en) pair, where bordered area shows correlations between human assessment and metrics, the rest shows inter-correlations across metrics and DA is direct assessment rated by language experts. they strongly disagree on high-scoring system outputs.…”

Section: Further Analysismentioning

confidence: 99%

See 1 more Smart Citation

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

Zhao¹,

Peyrard²,

Liu³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

Self Cite

315

263

View full text Add to dashboard Cite

A robust evaluation metric has a profound impact on the development of text generation systems. A desirable metric compares system output against references based on their semantics rather than surface forms. In this paper we investigate strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality. We validate our new metric, namely MoverScore, on a number of text generation tasks including summarization, machine translation, image captioning, and data-to-text generation, where the outputs are produced by a variety of neural and non-neural systems. Our findings suggest that metrics combining contextualized representations with a distance measure perform the best. Such metrics also demonstrate strong generalization capability across tasks. For ease-of-use we make our metrics available as web service. 1

show abstract

supporting

confidence: 91%

Section: Further Analysismentioning

confidence: 99%

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

Zhao¹,

Peyrard²,

Liu³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

Self Cite

315

263

View full text Add to dashboard Cite

show abstract

“…• One can repeat our bias study on evaluation metrics. Peyrard (2019b) showed that widely used evaluation metrics (e.g., ROUGE, Jensen-Shannon divergence) are strongly mismatched in scoring summary results. One can compare different measures (e.g., n-gram recall, sentence overlaps, embedding similarities, word connectedness, centrality, importance reflected by discourse structures), and study bias of each with respect to systems and corpora.…”

Section: Discussionmentioning

confidence: 99%

Earlier Isn’t Always Better: Sub-aspect Analysis on Corpus and System Biases in Summarization

Jung¹,

Kang²,

Mentch³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

Despite the recent developments on neural summarization systems, the underlying logic behind the improvements from the systems and its corpus-dependency remains largely unexplored. Position of sentences in the original text, for example, is a well known bias for news summarization. Following in the spirit of the claim that summarization is a combination of sub-functions, we define three sub-aspects of summarization: position, importance, and diversity and conduct an extensive analysis of the biases of each sub-aspect with respect to the domain of nine different summarization corpora (e.g., news, academic papers, meeting minutes, movie script, books, posts). We find that while position exhibits substantial bias in news articles, this is not the case, for example, with academic papers and meeting minutes. Furthermore, our empirical study shows that different types of summarization systems (e.g., neural-based) are composed of different degrees of the sub-aspects. Our study provides useful lessons regarding consideration of underlying sub-aspects when collecting a new summarization dataset or developing a new system.

show abstract

“…However, the classic TAC meta-evaluation datasets are now 6-12 years old 2 and it is not clear whether conclusions found there will hold with modern systems and summarization tasks. Two earlier works exemplify this disconnect: (1) Peyrard (2019) observed that the human-annotated summaries in the TAC dataset are mostly of lower quality than those produced by modern systems and that various automated evaluation metrics strongly disagree in the higher-scoring range in which current systems now operate. (2) Rankel et al (2013) observed that the correlation between ROUGE and human judgments in the TAC dataset decreases when looking at the best systems only, even for systems from eight years ago, which are far from today's state-of-the-art.…”

Section: Introductionmentioning

confidence: 99%

Re-evaluating Evaluation in Text Summarization

Bhandari

Gour

Ashfaq

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

100

View full text Add to dashboard Cite

Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks such as text summarization. However, while the field has progressed, our standard metrics have not -for nearly 20 years ROUGE has been the standard evaluation in most summarization papers. In this paper, we make an attempt to re-evaluate the evaluation method for text summarization: assessing the reliability of automatic metrics using top-scoring system outputs, both abstractive and extractive, on recently popular datasets for both systemlevel and summary-level evaluation settings. We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems. We release a dataset of human judgments that are collected from 25 top-scoring neural summarization systems (14 abstractive and 11 extractive):

show abstract

Studying Summarization Evaluation Metrics in the Appropriate Scoring Range

Cited by 46 publications

References 15 publications

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

Earlier Isn’t Always Better: Sub-aspect Analysis on Corpus and System Biases in Summarization

Re-evaluating Evaluation in Text Summarization

Contact Info

Product

Resources

About