2021
DOI: 10.1609/aaai.v35i16.17672
|View full text |Cite
|
Sign up to set email alerts
|

Style-transfer and Paraphrase: Looking for a Sensible Semantic Similarity Metric

Abstract: The rapid development of such natural language processing tasks as style transfer, paraphrase, and machine translation often calls for the use of semantic similarity metrics. In recent years a lot of methods to measure the semantic similarity of two short texts were developed. This paper provides a comprehensive analysis for more than a dozen of such methods. Using a new dataset of fourteen thousand sentence pairs human-labeled according to their semantic similarity, we demonstrate that none of the metrics wid… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
2
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 14 publications
(4 citation statements)
references
References 20 publications
(17 reference statements)
0
2
0
Order By: Relevance
“…It is worth mentioning that Yamshchikov et al (2021) proved in a recent study that fastText and Word2Vec pre-trained embedding vectors should not be used to evaluate text style transfer approaches in terms of content preservation. They demonstrated how such evaluation pipelines suffer from inaccurate content prediction, in analogy to similar human judgements.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…It is worth mentioning that Yamshchikov et al (2021) proved in a recent study that fastText and Word2Vec pre-trained embedding vectors should not be used to evaluate text style transfer approaches in terms of content preservation. They demonstrated how such evaluation pipelines suffer from inaccurate content prediction, in analogy to similar human judgements.…”
Section: Methodsmentioning
confidence: 99%
“…To evaluate fluency and content preservation, Hu et al (2022) introduced metrics such as perplexity score (PPL), style transfer accuracy (ACC), Word Overlap (WO) and self-BLEU, further summarizing these with Geometric Mean (G-Score) and Harmonic Mean (H-Score). Conversely, Yamshchikov et al (2021) argued against the usage of fastText and Word2Vec embeddings for the evaluation of content preservation in text style transfer.…”
Section: Related Workmentioning
confidence: 99%
“…Mir et al 41 constructed style-specific dictionaries to remove or mask style-related words to improve this shortcoming of BLEU, while Pang et al 19 argued that for complex tasks, the masking process is not conducive to retaining content or semantic information. Yamshchikov et al 44 argue that none of the current measures used for semantic similarity assessment are consistent with human understanding of semantic similarity, and that the method of using WMD to calculate content retention has the highest correlation with human evaluation.…”
Section: Automatic Evaluationmentioning
confidence: 99%
“…Our framework continues this line of research to produce interpretable metrics for multiple aspects. While recent evaluation frameworks each discussed the key evaluation aspects of one NLG task (Venkatesh et al, 2018;Mir et al, 2019;Yamshchikov et al, 2020;Fabbri et al, 2021), our framework provides a unified methodology that facilitates metric design for all the three main categories of tasks. We also highlight that all of metrics (except for the relevance metric for summarization) are reference-free once trained.…”
Section: Related Workmentioning
confidence: 99%