Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2019
DOI: 10.18653/v1/p19-1264
|View full text |Cite
|
Sign up to set email alerts
|

Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts

Abstract: For evaluating machine-generated texts, automatic methods hold the promise of avoiding collection of human judgments, which can be expensive and time-consuming. The most common automatic metrics, like BLEU and ROUGE, depend on exact word matching, an inflexible approach for measuring semantic similarity. We introduce methods based on sentence mover's similarity; our automatic metrics evaluate text in a continuous space using word and sentence embeddings. We find that sentence-based metrics correlate with human… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
105
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 127 publications
(114 citation statements)
references
References 39 publications
0
105
0
Order By: Relevance
“…The ASAP-SAS data is also broken down by topic. The current state-of-the-art model only evaluates on topic #3 -specifically, on essays from topic #3 that contain 5 to 15 sentences (Clark et al, 2019). Therefore, to allow for a fair comparison, we also draw test sentences from this subset.…”
Section: Grading Essaysmentioning
confidence: 99%
See 1 more Smart Citation
“…The ASAP-SAS data is also broken down by topic. The current state-of-the-art model only evaluates on topic #3 -specifically, on essays from topic #3 that contain 5 to 15 sentences (Clark et al, 2019). Therefore, to allow for a fair comparison, we also draw test sentences from this subset.…”
Section: Grading Essaysmentioning
confidence: 99%
“…Since BLEU Neighbors does not use references, it is at a disadvantage compared to approaches that do, such as ROUGE-L. Excluding ROUGE-L, all the models we list in Table 3 are optimal transport methods that leverage text embeddings (Clark et al, 2019 Despite not being given the gold-standard reference, when BLEU Neighbors is trained with sample essays from Topic #8, it achieves a new stateof-the-art: a Spearman's ρ of 0.500 between its predicted scores and the ground-truth quality judgments. However, due to the small amount of test data, this improvement over the state-of-the-art is not statistically significant at p < 0.01 when using a Williams test.…”
Section: Automated Essay Gradingmentioning
confidence: 99%
“…Earlier metrics like BLEU and ROUGE (Papineni et al, 2002;Lin, 2004), considered n-gram agreement. Later metrics matched words in the two texts using their word embeddings (Lo, 2017;Clark et al, 2019). More recently, contextual similarity measures were devised for this purpose (Lo, 2019;Wieting et al, 2019;Zhao et al, 2019;Zhang et al, 2020;Sellam et al, 2020).…”
Section: Generation Evaluationmentioning
confidence: 99%
“…In this analysis, we focus on potential alternative evaluation measures. As mentioned in §2, a possible direction for solving issues in evaluation of sentence fusion-stemming from having a single reference-could be to use similarity-based evaluation metrics (Sellam et al, 2020;Kusner et al, 2015;Clark et al, 2019;Zhang et al, 2020). We notice two limitations in applying such metrics for sentence fusion.…”
Section: Ablation Analysismentioning
confidence: 99%
“…These approaches bear limitations in dealing with the text's diverse nature, similarly found in other text generation tasks (e.g., abstractive summarization and dialog) (Kryscinski et al, 2019;Liu et al, 2016). To alleviate the issues in the n-gram based approaches, researchers proposed word embedding-based techniques (Kusner et al, 2015;Zhao et al, 2019;Lo, 2019;Clark et al, 2019). These techniques shows robust performance and achieve higher correlation with human judgment than that of other previous metrics in many text generation tasks, including image captioning.…”
Section: Introductionmentioning
confidence: 99%