Towards Explainable Evaluation Metrics for Natural Language Generation

Leiter, Christoph; Lertvittayakumjorn, Piyawat; Fomicheva, Marina; Wang, Zhao; Gao, Yang; Eger, Steffen

doi:10.48550/arxiv.2203.11131

Cited by 2 publications

(3 citation statements)

References 115 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent demand for explainability in evaluation metrics has grown significantly. Freitag et al (2021a) introduce a multi-dimensional human evaluation (MQM) framework for machine translation, while Leiter et al (2022) investigates key characteristics of explainable metrics. Several metrics derived from those frameworks enhance explainability by differentiating error severity Xu et al, 2022b,a;Perrella et al, 2022).…”

Section: Related Workmentioning

confidence: 99%

INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback

Xu,

Wang,

Pan

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Automatically evaluating the quality of language generation is critical. Although recent learned metrics show high correlation with human judgement, these metrics do not provide explicit explanation of their verdict, nor associate the scores with defects in the generated text. To address this limitation, we present IN-STRUCTSCORE, a fine-grained explainable evaluation metric for text generation. By harnessing both explicit human instruction and the implicit knowledge of GPT-4, we fine-tune a text evaluation metric based on LLaMA, producing both a score for generated text and a human readable diagnostic report. We evaluate INSTRUCTSCORE on a variety of generation tasks, including translation, captioning, data-to-text, and commonsense generation. Experiments show that our 7B model surpasses all other unsupervised metrics, including those based on 175B GPT-3 and GPT-4. Surprisingly, our INSTRUCTSCORE, even without direct supervision from human-rated data, achieves performance levels on par with state-of-the-art metrics like COMET22, which were fine-tuned on human ratings.

show abstract

Section: Related Workmentioning

confidence: 99%

INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback

Xu,

Wang,

Pan

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…To evaluate closeness (or similarity) between machine translation text with human reference texts, the BLEU metric was proposed [10]. The machine translation community has been using this metric frequently to compare different translation systems.…”

Section: Related Workmentioning

confidence: 99%

“…There are other non-trained automatic evaluation metrics such as BLEU that evaluate the similarity between machine translation text with human reference texts [10]. Those metrics can serve as an evaluation of machine translation texts' novelty to human reference texts.…”

Section: Introductionmentioning

confidence: 99%

Creativity Evaluation Method for Procedural Content Generated Game Items via Machine Learning

Zhou

Guzdial

et al. 2022

2022 9th International Conference on Dependable Systems and Their Applications (DSA)

View full text Add to dashboard Cite

Procedural Content Generation via Machine Learning (PCGML) refers to methods that apply machine learning algorithms to generate game content. In particular, the generation of game item descriptions requires techniques to evaluate the similarity between items, and consequently their creativity. This paper improves the BLEU2vec text similarity evaluation technique by integrating it with Byte Pair Encoding (BPE) to capture the relevance of compound words in generated game item descriptions. This novel technique, called Split BLEU2vec, splits compound words into sub-words enabling their similarity evaluation. Our results show that when compared to BLEU2vec baseline, Split BLEu2vec is able to account for semantic embedding of compound words in item descriptions of the game Legend of Zelda.

show abstract

Towards Explainable Evaluation Metrics for Natural Language Generation

Cited by 2 publications

References 115 publications

INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback

INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback

Creativity Evaluation Method for Procedural Content Generated Game Items via Machine Learning

Contact Info

Product

Resources

About