2022
DOI: 10.48550/arxiv.2203.11131
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Towards Explainable Evaluation Metrics for Natural Language Generation

Abstract: Unlike classical lexical overlap metrics such as BLEU, most current evaluation metrics (such as BERTScore or MoverScore) are based on black-box language models such as BERT or XLM-R. They often achieve strong correlations with human judgments, but recent research indicates that the lower-quality classical metrics remain dominant, one of the potential reasons being that their decision processes are transparent. To foster more widespread acceptance of the novel high-quality metrics, explainability thus becomes c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
0
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 115 publications
0
0
0
Order By: Relevance
“…Recent demand for explainability in evaluation metrics has grown significantly. Freitag et al (2021a) introduce a multi-dimensional human evaluation (MQM) framework for machine translation, while Leiter et al (2022) investigates key characteristics of explainable metrics. Several metrics derived from those frameworks enhance explainability by differentiating error severity Xu et al, 2022b,a;Perrella et al, 2022).…”
Section: Related Workmentioning
confidence: 99%
“…Recent demand for explainability in evaluation metrics has grown significantly. Freitag et al (2021a) introduce a multi-dimensional human evaluation (MQM) framework for machine translation, while Leiter et al (2022) investigates key characteristics of explainable metrics. Several metrics derived from those frameworks enhance explainability by differentiating error severity Xu et al, 2022b,a;Perrella et al, 2022).…”
Section: Related Workmentioning
confidence: 99%
“…To evaluate closeness (or similarity) between machine translation text with human reference texts, the BLEU metric was proposed [10]. The machine translation community has been using this metric frequently to compare different translation systems.…”
Section: Related Workmentioning
confidence: 99%
“…There are other non-trained automatic evaluation metrics such as BLEU that evaluate the similarity between machine translation text with human reference texts [10]. Those metrics can serve as an evaluation of machine translation texts' novelty to human reference texts.…”
Section: Introductionmentioning
confidence: 99%