Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) 2014
DOI: 10.3115/v1/p14-2074
|View full text |Cite
|
Sign up to set email alerts
|

Comparing Automatic Evaluation Measures for Image Description

Abstract: Image description is a new natural language generation task, where the aim is to generate a human-like description of an image. The evaluation of computer-generated text is a notoriously difficult problem, however, the quality of image descriptions has typically been measured using unigram BLEU and human judgements. The focus of this paper is to determine the correlation of automatic measures with human judgements for this task. We estimate the correlation of unigram and Smoothed BLEU, TER, ROUGE-SU4, and Mete… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
85
0

Year Published

2014
2014
2021
2021

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 107 publications
(86 citation statements)
references
References 12 publications
1
85
0
Order By: Relevance
“…Elliott and Keller (2014) find that both metrics correlate well with human judgments. For a fair comparison, we force our model to output one description, i.e., the most relevant one.…”
Section: Resultssupporting
confidence: 53%
“…Elliott and Keller (2014) find that both metrics correlate well with human judgments. For a fair comparison, we force our model to output one description, i.e., the most relevant one.…”
Section: Resultssupporting
confidence: 53%
“…Evaluation. Automatic evaluation remains to be a challenge (Elliott and Keller, 2014). We report both BLEU (Papineni et al, 2002) at 1 without brevity penalty, and METEOR (Banerjee and Lavie, 2005) with balanced precision and recall.…”
Section: Experiments: Association Structurementioning
confidence: 99%
“…In turn, a lot of research on NLG evaluation focussed on defining and validating automatic evaluation measures. Such a metric is typically considered valid if it correlates well with human judgements of text quality (Stent et al, 2005;Foster, 2008;Reiter and Belz, 2009;Cahill, 2009;Elliott and Keller, 2014). However, automatic evaluation measures in NLG still have a range of known conceptual deficits, i.e.…”
Section: Background On Nlg Evaluationmentioning
confidence: 99%
“…One of the most widely applied and least controversial NLG evaluation methods is to collect human ratings. Human ratings have been used for system comparison in a number of NLG shared tasks (Gatt and Belz, 2010;, for validating other automatic evaluation methods in NLG (Reiter and Belz, 2009;Cahill, 2009;Elliott and Keller, 2014), and for training statistical components of NLG systems (Stent et al, 2004;Mairesse and Walker, 2011;Howcroft et al, 2013).…”
Section: Introductionmentioning
confidence: 99%