Proceedings of the 28th International Conference on Computational Linguistics 2020
DOI: 10.18653/v1/2020.coling-main.210
|View full text |Cite
|
Sign up to set email alerts
|

Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale

Abstract: Automatic evaluation of language generation systems is a well-studied problem in Natural Language Processing. While novel metrics are proposed every year, a few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation, despite their known limitations. This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them. In this paper, we urge the community for more careful consideration of how they automatic… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
14
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 18 publications
(15 citation statements)
references
References 20 publications
1
14
0
Order By: Relevance
“…To better assess the performance of a captioning system, it is common practice to consider a set of the above-mentioned standard metrics. Nevertheless, these are somehow gameable because they favor word similarity rather than meaning correctness [138]. Another drawback of the standard metrics is that they do not capture (but rather disfavor) the desirable capability of the system to produce novel and diverse captions, which is more in line with the variability with which humans describe complex images.…”
Section: Diversity Metricsmentioning
confidence: 99%
“…To better assess the performance of a captioning system, it is common practice to consider a set of the above-mentioned standard metrics. Nevertheless, these are somehow gameable because they favor word similarity rather than meaning correctness [138]. Another drawback of the standard metrics is that they do not capture (but rather disfavor) the desirable capability of the system to produce novel and diverse captions, which is more in line with the variability with which humans describe complex images.…”
Section: Diversity Metricsmentioning
confidence: 99%
“…Recent work has shown that other metrics, such as diversity of outputs, are important for evaluating the quality of LMs as models for language generation (Hashimoto et al, 2019;Caccia et al, 2020). Generation also depends on a number of other factors, such as choice of decoding procedure (Caglayan et al, 2020). Here, we focus on LMs as predictive models, measuring their ability to place an accurate distribution over future words and sentences, rather than their ability to generate useful or coherent text (see Appendix C).…”
Section: |Nmentioning
confidence: 99%
“…For open-ended text generation tasks like answering why-questions, the absence of an automatic evaluation that correlates well with human judgments is a major challenge (Chen et al, 2019;Ma et al, 2019;Caglayan et al, 2020;Howcroft et al, 2020).…”
Section: Human Evaluationmentioning
confidence: 99%