Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts

Clark, Elizabeth; Çelikyılmaz, Aslı; Smith, Noah A.

doi:10.18653/v1/p19-1264

Cited by 127 publications

(114 citation statements)

References 39 publications

Supporting

Mentioning

105

Contrasting

Order By: Relevance

“…The ASAP-SAS data is also broken down by topic. The current state-of-the-art model only evaluates on topic #3 -specifically, on essays from topic #3 that contain 5 to 15 sentences (Clark et al, 2019). Therefore, to allow for a fair comparison, we also draw test sentences from this subset.…”

Section: Grading Essaysmentioning

confidence: 99%

“…Since BLEU Neighbors does not use references, it is at a disadvantage compared to approaches that do, such as ROUGE-L. Excluding ROUGE-L, all the models we list in Table 3 are optimal transport methods that leverage text embeddings (Clark et al, 2019 Despite not being given the gold-standard reference, when BLEU Neighbors is trained with sample essays from Topic #8, it achieves a new stateof-the-art: a Spearman's ρ of 0.500 between its predicted scores and the ground-truth quality judgments. However, due to the small amount of test data, this improvement over the state-of-the-art is not statistically significant at p < 0.01 when using a Williams test.…”

Section: Automated Essay Gradingmentioning

confidence: 99%

See 1 more Smart Citation

BLEU Neighbors: A Reference-less Approach to Automatic Evaluation

Ethayarajh¹,

Sadigh²

2020

Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

View full text Add to dashboard Cite

Evaluation is a bottleneck in the development of natural language generation (NLG) models. Automatic metrics such as BLEU rely on references, but for tasks such as open-ended generation, there are no references to draw upon. Although language diversity can be estimated using statistical measures such as perplexity, measuring language quality requires human evaluation. However, because human evaluation at scale is slow and expensive, it is used sparingly; it cannot be used to rapidly iterate on NLG models, in the way BLEU is used for machine translation. To this end, we propose BLEU Neighbors, a nearest neighbors model for estimating language quality by using the BLEU score as a kernel function. On existing datasets for chitchat dialogue and open-ended sentence generation, we find that -on average -the quality estimation from a BLEU Neighbors model has a lower mean squared error and higher Spearman correlation with the ground truth than individual human annotators. Despite its simplicity, BLEU Neighbors even outperforms state-of-the-art models on automatically grading essays, including models that have access to a gold-standard reference essay.

show abstract

Section: Grading Essaysmentioning

confidence: 99%

Section: Automated Essay Gradingmentioning

confidence: 99%

BLEU Neighbors: A Reference-less Approach to Automatic Evaluation

Ethayarajh¹,

Sadigh²

2020

Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

View full text Add to dashboard Cite

show abstract

“…Earlier metrics like BLEU and ROUGE (Papineni et al, 2002;Lin, 2004), considered n-gram agreement. Later metrics matched words in the two texts using their word embeddings (Lo, 2017;Clark et al, 2019). More recently, contextual similarity measures were devised for this purpose (Lo, 2019;Wieting et al, 2019;Zhao et al, 2019;Zhang et al, 2020;Sellam et al, 2020).…”

Section: Generation Evaluationmentioning

confidence: 99%

“…In this analysis, we focus on potential alternative evaluation measures. As mentioned in §2, a possible direction for solving issues in evaluation of sentence fusion-stemming from having a single reference-could be to use similarity-based evaluation metrics (Sellam et al, 2020;Kusner et al, 2015;Clark et al, 2019;Zhang et al, 2020). We notice two limitations in applying such metrics for sentence fusion.…”

Section: Ablation Analysismentioning

confidence: 99%

Semantically Driven Sentence Fusion: Modeling and Evaluation

Ben-David

Keller

Malmi

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Sentence fusion is the task of joining related sentences into coherent text. Current training and evaluation schemes for this task are based on single reference ground-truths and do not account for valid fusion variants. We show that this hinders models from robustly capturing the semantic relationship between input sentences. To alleviate this, we present an approach in which ground-truth solutions are automatically expanded into multiple references via curated equivalence classes of connective phrases. We apply this method to a large-scale dataset and use the augmented dataset for both model training and evaluation. To improve the learning of semantic representation using multiple references, we enrich the model with auxiliary discourse classification tasks under a multi-tasking framework. Our experiments highlight the improvements of our approach over state-of-the-art models. 1

show abstract

“…These approaches bear limitations in dealing with the text's diverse nature, similarly found in other text generation tasks (e.g., abstractive summarization and dialog) (Kryscinski et al, 2019;Liu et al, 2016). To alleviate the issues in the n-gram based approaches, researchers proposed word embedding-based techniques (Kusner et al, 2015;Zhao et al, 2019;Lo, 2019;Clark et al, 2019). These techniques shows robust performance and achieve higher correlation with human judgment than that of other previous metrics in many text generation tasks, including image captioning.…”

Section: Introductionmentioning

confidence: 99%

ViLBERTScore: Evaluating Image Caption Using Vision-and-Language BERT

Lee¹,

Yoon²,

Dernoncourt³

et al. 2020

Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

View full text Add to dashboard Cite

In this paper, we propose an evaluation metric for image captioning systems using both image and text information. Unlike the previous methods that rely on textual representations in evaluating the caption, our approach uses visiolinguistic representations. The proposed method generates image-conditioned embeddings for each token using ViLBERT from both generated and reference texts. Then, these contextual embeddings from each of the two sentence-pair are compared to compute the similarity score. Experimental results on three benchmark datasets show that our method correlates significantly better with human judgments than all existing metrics.

show abstract

Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts

Cited by 127 publications

References 39 publications

BLEU Neighbors: A Reference-less Approach to Automatic Evaluation

BLEU Neighbors: A Reference-less Approach to Automatic Evaluation

Semantically Driven Sentence Fusion: Modeling and Evaluation

ViLBERTScore: Evaluating Image Caption Using Vision-and-Language BERT

Contact Info

Product

Resources

About