Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 2018
DOI: 10.18653/v1/d18-1085
|View full text |Cite
|
Sign up to set email alerts
|

A Graph-theoretic Summary Evaluation for ROUGE

Abstract: ROUGE is one of the first and most widely used evaluation metrics for text summarization. However, its assessment merely relies on surface similarities between peer and model summaries. Consequently, ROUGE is unable to fairly evaluate summaries including lexical variations and paraphrasing. We propose a graph-based approach adopted into ROUGE to evaluate summaries based on both lexical and semantic similarities. Experiment results over TAC AESOP datasets show that exploiting the lexico-semantic similarity of t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
14
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 24 publications
(16 citation statements)
references
References 10 publications
0
14
0
Order By: Relevance
“…Existing work often limits model comparisons to only a few baselines and offers human evaluations which are largely inconsistent with prior work. Additionally, despite problems associated with ROUGE when used outside of its original setting (Liu and Liu, 2008;Cohan and Goharian, 2016) as well as the introduction of many variations on ROUGE (Zhou et al, 2006;Ng and Abrecht, 2015;Ganesan, 2015;ShafieiBavani et al, 2018) and other text generation metrics (Peyrard, 2019;Zhao et al, 2019;Zhang et al, 2020;Scialom et al., 2019;Clark et al, 2019), ROUGE has remained the default automatic evaluation metric. We believe that the shortcomings of the current evaluation protocol are partially caused by the lack of easy-to-use resources for evaluation, both in the form of simplified evaluation toolkits and large collections of model outputs.…”
Section: Introductionmentioning
confidence: 99%
“…Existing work often limits model comparisons to only a few baselines and offers human evaluations which are largely inconsistent with prior work. Additionally, despite problems associated with ROUGE when used outside of its original setting (Liu and Liu, 2008;Cohan and Goharian, 2016) as well as the introduction of many variations on ROUGE (Zhou et al, 2006;Ng and Abrecht, 2015;Ganesan, 2015;ShafieiBavani et al, 2018) and other text generation metrics (Peyrard, 2019;Zhao et al, 2019;Zhang et al, 2020;Scialom et al., 2019;Clark et al, 2019), ROUGE has remained the default automatic evaluation metric. We believe that the shortcomings of the current evaluation protocol are partially caused by the lack of easy-to-use resources for evaluation, both in the form of simplified evaluation toolkits and large collections of model outputs.…”
Section: Introductionmentioning
confidence: 99%
“…In recent years, more ROUGE-based evaluation models have been proposed to compare golden summaries and machine-generated summaries not just according to the literal similarity, but also consider on semantic similarity [115,149,154].…”
Section: Rouge-s and Rouge-su Rouge-s [72] Stands For Rouge With Skip...mentioning
confidence: 99%
“…It is thus not very ideal for model training. Recently, some works extend ROUGE along with WordNet [115] or pretrained language models [150] tending to alleviate the drawbacks. However, it is challenging to propose evaluation indicators that can reflect the true quality of generated summaries comprehensively and semantically as human raters.…”
Section: Improving Evaluation Metrics For Multi-document Summarizationmentioning
confidence: 99%
“…The widely accepted metric is ROUGE (Lin, 2004) that focuses primarily on n-gram co-occurrence statistics. Some strategies are proposed to replace the "hard matching" of ROUGE, such as the adoption of WordNet (ShafieiBavani et al, 2018) and the fusion of ROUGE and word2vec (Ng and Abrecht, 2015). Another promising method of designing metrics is to directly compute the semantic similarity of peer and reference summary, including the metrics utilizing various word embeddings such as ELMo (Sun and Nenkova, 2019) and BERT (Zhang et al, 2019;Zhao et al, 2019).…”
Section: Related Workmentioning
confidence: 99%