Automatically Assessing Machine Summary Content Without a Gold Standard

Louis, Annie; Nenkova, Ani

doi:10.1162/coli_a_00123

Cited by 133 publications

(115 citation statements)

References 28 publications

Supporting

Mentioning

114

Contrasting

Order By: Relevance

“…Others have used distribution-similarity measures such as Kullback-Leibler (KL) divergence and Jensen Shannon (JS) divergence [21,43], textual entailment [44] and crowdsourcing based LSA [18] for evaluating summaries. However, relatively few studies have used machinelearning techniques for summary evaluation beyond the aforementioned regressionbased approaches [45][46][47].…”

Section: Automated Summary Evaluationmentioning

confidence: 99%

“…In recent years, the research community has been successful in developing various measures for evaluating summaries. Some of the automated summary evaluation tools include Recall-Oriented Understudy for Gisting Evaluation (ROUGE [20]), ParaEval, Summary Input similarity Metrics (SIMetrix [21], QARLA [22], and SEMantic similarity toolkit (SEMILAR [23]). …”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Scoring Summaries Using Recurrent Neural Networks

Ruşeţi

Dascălu

Johnson

et al. 2018

Intelligent Tutoring Systems

View full text Add to dashboard Cite

Abstract. Summarization enhances comprehension and is considered an effective strategy to promote and enhance learning and deep understanding of texts. However, summarization is seldom implemented by teachers in classrooms because the manual evaluation requires a lot of effort and time. Although the need for automated support is stringent, there are only a few shallow systems available, most of which rely on basic word/n-gram overlaps. In this paper, we introduce a hybrid model that uses state-of-the-art recurrent neural networks and textual complexity indices to score summaries. Our best model achieves over 55% accuracy for a 3-way classification that measures the degree to which the main ideas from the original text are covered by the summary . Our experiments show that the writing style, represented by the textual complexity indices, together with the semantic content grasped within the summary are the best predictors, when combined. To the best of our knowledge, this is the first work of its kind that uses RNNs for scoring and evaluating summaries.

show abstract

Section: Automated Summary Evaluationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Scoring Summaries Using Recurrent Neural Networks

Ruşeţi

Dascălu

Johnson

et al. 2018

Intelligent Tutoring Systems

View full text Add to dashboard Cite

show abstract

“…This is due to both the corpus size, the diversity of event-related topics and the limited availability of domain experts. To alleviate this issue here, we followed the distribution similarity approach, which has been widely applied in the automatic generation of gold standards (GSs) for summary evaluations (Donaway et al, 2000;Lin et al, 2006;Louis and Nenkova, 2009;Louis and Nenkova, 2013). This approach compares two corpora, one for which no GS labels exist, against a reference corpus for which a GS exists.…”

Section: Generating the Gold Standardmentioning

confidence: 99%

Automatic Labelling of Topic Models Learned from Twitter by Summarisation

Basave

2014

Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

View full text Add to dashboard Cite

Latent topics derived by topic models such as Latent Dirichlet Allocation (LDA) are the result of hidden thematic structures which provide further insights into the data. The automatic labelling of such topics derived from social media poses however new challenges since topics may characterise novel events happening in the real world. Existing automatic topic labelling approaches which depend on external knowledge sources become less applicable here since relevant articles/concepts of the extracted topics may not exist in external sources. In this paper we propose to address the problem of automatic labelling of latent topics learned from Twitter as a summarisation problem. We introduce a framework which apply summarisation algorithms to generate topic labels. These algorithms are independent of external sources and only rely on the identification of dominant terms in documents related to the latent topic. We compare the efficiency of existing state of the art summarisation algorithms. Our results suggest that summarisation algorithms generate better topic labels which capture event-related context compared to the top-n terms returned by LDA.

show abstract

“…This approach is similar to using system outputs as pseudomodels for the evaluation of machine-translation or automatic-summarization systems (cf. Louis and Nenkova (2013)). It has also been successfully applied to the content assessment of written answers by Madnani et al (2013) who used one randomly selected highly scored summary as a reference summary.…”

Section: Computation Of the Metricsmentioning

confidence: 99%

“…Since previous work on summarization evaluation showed that multiple summaries increase the reliability of evaluations (Louis and Nenkova, 2013;Nenkova and McKeown, 2011), we tested how many summaries were necessary to achieve consistent results. We therefore computed ROUGE for each response using up to 10 randomly selected responses with final score of 4.…”

Section: Computation Of the Metricsmentioning

confidence: 99%

Automatic evaluation of spoken summaries: the case of language assessment

Loukina

Zechner

Chen

2014

Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications

View full text Add to dashboard Cite

This paper investigates whether ROUGE, a popular metric for the evaluation of automated written summaries, can be applied to the assessment of spoken summaries produced by non-native speakers of English. We demonstrate that ROUGE, with its emphasis on the recall of information, is particularly suited to the assessment of the summarization quality of non-native speakers' responses. A standard baseline implementation of ROUGE-1 computed over the output of the automated speech recognizer has a Spearman correlation of ρ = 0.55 with experts' scores of speakers' proficiency (ρ = 0.51 for a content-vector baseline). Further increases in agreement with experts' scores can be achieved by using types instead of tokens for the computation of word frequencies for both candidate and reference summaries, as well as by using multiple reference summaries instead of a single one. These modifications increase the correlation with experts' scores to a Spearman correlation of ρ = 0.65. Furthermore, we found that the choice of reference summaries does not have any impact on performance, and that the adjusted metric is also robust to errors introduced by automated speech recognition (ρ = 0.67 for human transcriptions vs. ρ = 0.65 for speech recognition output).

show abstract

Automatically Assessing Machine Summary Content Without a Gold Standard

Cited by 133 publications

References 28 publications

Scoring Summaries Using Recurrent Neural Networks

Scoring Summaries Using Recurrent Neural Networks

Automatic Labelling of Topic Models Learned from Twitter by Summarisation

Automatic evaluation of spoken summaries: the case of language assessment

Contact Info

Product

Resources

About