A Graph-theoretic Summary Evaluation for ROUGE

ShafieiBavani, Elaheh; Ebrahimi, Mohammad; Wong, Raymond K.; Chen, Fang

doi:10.18653/v1/d18-1085

Cited by 24 publications

(16 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Existing work often limits model comparisons to only a few baselines and offers human evaluations which are largely inconsistent with prior work. Additionally, despite problems associated with ROUGE when used outside of its original setting (Liu and Liu, 2008;Cohan and Goharian, 2016) as well as the introduction of many variations on ROUGE (Zhou et al, 2006;Ng and Abrecht, 2015;Ganesan, 2015;ShafieiBavani et al, 2018) and other text generation metrics (Peyrard, 2019;Zhao et al, 2019;Zhang et al, 2020;Scialom et al., 2019;Clark et al, 2019), ROUGE has remained the default automatic evaluation metric. We believe that the shortcomings of the current evaluation protocol are partially caused by the lack of easy-to-use resources for evaluation, both in the form of simplified evaluation toolkits and large collections of model outputs.…”

Section: Introductionmentioning

confidence: 99%

SummEval: Re-evaluating Summarization Evaluation

Fabbri

Kryściński

McCann

et al. 2021

Transactions of the Association for Computational Linguistics

248

166

View full text Add to dashboard Cite

The scarcity of comprehensive up-to-date studies on evaluation metrics for text summarization and the lack of consensus regarding evaluation protocols continue to inhibit progress. We address the existing shortcomings of summarization evaluation methods along five dimensions: 1) we re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion using neural summarization model outputs along with expert and crowd-sourced human annotations; 2) we consistently benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics; 3) we assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset and share it in a unified format; 4) we implement and share a toolkit that provides an extensible and unified API for evaluating summarization models across a broad range of automatic metrics; and 5) we assemble and share the largest and most diverse, in terms of model types, collection of human judgments of model-generated summaries on the CNN/Daily Mail dataset annotated by both expert judges and crowd-source workers. We hope that this work will help promote a more complete evaluation protocol for text summarization as well as advance research in developing evaluation metrics that better correlate with human judgments.

show abstract

Section: Introductionmentioning

confidence: 99%

SummEval: Re-evaluating Summarization Evaluation

Fabbri

Kryściński

McCann

et al. 2021

Transactions of the Association for Computational Linguistics

248

166

View full text Add to dashboard Cite

show abstract

“…In recent years, more ROUGE-based evaluation models have been proposed to compare golden summaries and machine-generated summaries not just according to the literal similarity, but also consider on semantic similarity [115,149,154].…”

Section: Rouge-s and Rouge-su Rouge-s [72] Stands For Rouge With Skip...mentioning

confidence: 99%

“…It is thus not very ideal for model training. Recently, some works extend ROUGE along with WordNet [115] or pretrained language models [150] tending to alleviate the drawbacks. However, it is challenging to propose evaluation indicators that can reflect the true quality of generated summaries comprehensively and semantically as human raters.…”

Section: Improving Evaluation Metrics For Multi-document Summarizationmentioning

confidence: 99%

Multi-document Summarization via Deep Learning Techniques: A Survey

Ma¹,

Zhang²,

Guo³

et al. 2020

Preprint

View full text Add to dashboard Cite

Multi-document summarization (MDS) is an effective tool for information aggregation which generates an informative and concise summary from a cluster of topic-related documents. Our survey structurally overviews the recent deep learning based multi-document summarization models via a proposed taxonomy and it is the first of its kind. Particularly, we propose a novel mechanism to summarize the design strategies of neural networks and conduct a comprehensive summary of the state-of-the-art. We highlight the differences among various objective functions which are rarely discussed in the existing literature. Finally, we propose several future directions pertaining to this new and exciting development of the field.

show abstract

“…The widely accepted metric is ROUGE (Lin, 2004) that focuses primarily on n-gram co-occurrence statistics. Some strategies are proposed to replace the "hard matching" of ROUGE, such as the adoption of WordNet (ShafieiBavani et al, 2018) and the fusion of ROUGE and word2vec (Ng and Abrecht, 2015). Another promising method of designing metrics is to directly compute the semantic similarity of peer and reference summary, including the metrics utilizing various word embeddings such as ELMo (Sun and Nenkova, 2019) and BERT (Zhang et al, 2019;Zhao et al, 2019).…”

Section: Related Workmentioning

confidence: 99%

An Anchor-Based Automatic Evaluation Metric for Document Summarization

Wang¹,

Li²,

Chang³

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

The widespread adoption of reference-based automatic evaluation metrics such as ROUGE has promoted the development of document summarization. In this paper, we consider a new protocol for designing reference-based metrics that require the endorsement of source document(s). Following protocol, we propose an anchored ROUGE metric fixing each summary particle on source document, which bases the computation on more solid ground. Empirical results on benchmark datasets validate that source document helps to induce a higher correlation with human judgments for ROUGE metric. Being self-explanatory and easy-to-implement, the protocol can naturally foster various effective designs of reference-based metrics besides the anchored ROUGE introduced here.

show abstract

A Graph-theoretic Summary Evaluation for ROUGE

Cited by 24 publications

References 10 publications

SummEval: Re-evaluating Summarization Evaluation

SummEval: Re-evaluating Summarization Evaluation

Multi-document Summarization via Deep Learning Techniques: A Survey

An Anchor-Based Automatic Evaluation Metric for Document Summarization

Contact Info

Product

Resources

About