2013
DOI: 10.1162/coli_a_00123
|View full text |Cite
|
Sign up to set email alerts
|

Automatically Assessing Machine Summary Content Without a Gold Standard

Abstract: The most widely adopted approaches for evaluation of summary content follow some protocol for comparing a summary with gold-standard human summaries, which are traditionally called model summaries. This evaluation paradigm falls short when human summaries are not available and becomes less accurate when only a single model is available. We propose three novel evaluation techniques. Two of them are model-free and do not rely on a gold standard for the assessment. The third technique improves standard automatic … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
114
0

Year Published

2014
2014
2022
2022

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 133 publications
(115 citation statements)
references
References 28 publications
1
114
0
Order By: Relevance
“…Others have used distribution-similarity measures such as Kullback-Leibler (KL) divergence and Jensen Shannon (JS) divergence [21,43], textual entailment [44] and crowdsourcing based LSA [18] for evaluating summaries. However, relatively few studies have used machinelearning techniques for summary evaluation beyond the aforementioned regressionbased approaches [45][46][47].…”
Section: Automated Summary Evaluationmentioning
confidence: 99%
See 1 more Smart Citation
“…Others have used distribution-similarity measures such as Kullback-Leibler (KL) divergence and Jensen Shannon (JS) divergence [21,43], textual entailment [44] and crowdsourcing based LSA [18] for evaluating summaries. However, relatively few studies have used machinelearning techniques for summary evaluation beyond the aforementioned regressionbased approaches [45][46][47].…”
Section: Automated Summary Evaluationmentioning
confidence: 99%
“…In recent years, the research community has been successful in developing various measures for evaluating summaries. Some of the automated summary evaluation tools include Recall-Oriented Understudy for Gisting Evaluation (ROUGE [20]), ParaEval, Summary Input similarity Metrics (SIMetrix [21], QARLA [22], and SEMantic similarity toolkit (SEMILAR [23]). …”
Section: Introductionmentioning
confidence: 99%
“…This is due to both the corpus size, the diversity of event-related topics and the limited availability of domain experts. To alleviate this issue here, we followed the distribution similarity approach, which has been widely applied in the automatic generation of gold standards (GSs) for summary evaluations (Donaway et al, 2000;Lin et al, 2006;Louis and Nenkova, 2009;Louis and Nenkova, 2013). This approach compares two corpora, one for which no GS labels exist, against a reference corpus for which a GS exists.…”
Section: Generating the Gold Standardmentioning
confidence: 99%
“…This approach is similar to using system outputs as pseudomodels for the evaluation of machine-translation or automatic-summarization systems (cf. Louis and Nenkova (2013)). It has also been successfully applied to the content assessment of written answers by Madnani et al (2013) who used one randomly selected highly scored summary as a reference summary.…”
Section: Computation Of the Metricsmentioning
confidence: 99%
“…Since previous work on summarization evaluation showed that multiple summaries increase the reliability of evaluations (Louis and Nenkova, 2013;Nenkova and McKeown, 2011), we tested how many summaries were necessary to achieve consistent results. We therefore computed ROUGE for each response using up to 10 randomly selected responses with final score of 4.…”
Section: Computation Of the Metricsmentioning
confidence: 99%