Comparing Automatic Evaluation Measures for Image Description

Elliott, Desmond; Keller, Frank

doi:10.3115/v1/p14-2074

Cited by 107 publications

(86 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Elliott and Keller (2014) find that both metrics correlate well with human judgments. For a fair comparison, we force our model to output one description, i.e., the most relevant one.…”

Section: Resultssupporting

confidence: 53%

Learning to Interpret and Describe Abstract Scenes

Ortiz

Wolff²,

Lapata³

2015

Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Given a (static) scene, a human can effortlessly describe what is going on (who is doing what to whom, how, and why). The process requires knowledge about the world, how it is perceived, and described. In this paper we study the problem of interpreting and verbalizing visual information using abstract scenes created from collections of clip art images. We propose a model inspired by machine translation operating over a large parallel corpus of visual relations and linguistic descriptions. We demonstrate that this approach produces human-like scene descriptions which are both fluent and relevant, outperforming a number of competitive alternatives based on templates, sentence-based retrieval, and a multimodal neural language model.

show abstract

“…Elliott and Keller (2014) find that both metrics correlate well with human judgments. For a fair comparison, we force our model to output one description, i.e., the most relevant one.…”

Section: Resultssupporting

confidence: 53%

Learning to Interpret and Describe Abstract Scenes

Ortiz

Wolff²,

Lapata³

2015

Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

show abstract

“…Evaluation. Automatic evaluation remains to be a challenge (Elliott and Keller, 2014). We report both BLEU (Papineni et al, 2002) at 1 without brevity penalty, and METEOR (Banerjee and Lavie, 2005) with balanced precision and recall.…”

Section: Experiments: Association Structurementioning

confidence: 99%

Déjà Image-Captions: A Corpus of Expressive Descriptions in Repetition

Chen¹,

Kuznetsova²,

Warren³

et al. 2015

Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

We present a new approach to harvesting a large-scale, high quality image-caption corpus that makes a better use of already existing web data with no additional human efforts. The key idea is to focus on Déjà Image-Captions: naturally existing image descriptions that are repeated almost verbatim -by more than one individual for different images. The resulting corpus provides association structure between 4 million images with 180K unique captions, capturing a rich spectrum of everyday narratives including figurative and pragmatic language. Exploring the use of the new corpus, we also present new conceptual tasks of visually situated paraphrasing, creative image captioning, and creative visual paraphrasing.

show abstract

“…In turn, a lot of research on NLG evaluation focussed on defining and validating automatic evaluation measures. Such a metric is typically considered valid if it correlates well with human judgements of text quality (Stent et al, 2005;Foster, 2008;Reiter and Belz, 2009;Cahill, 2009;Elliott and Keller, 2014). However, automatic evaluation measures in NLG still have a range of known conceptual deficits, i.e.…”

Section: Background On Nlg Evaluationmentioning

confidence: 99%

“…One of the most widely applied and least controversial NLG evaluation methods is to collect human ratings. Human ratings have been used for system comparison in a number of NLG shared tasks (Gatt and Belz, 2010;, for validating other automatic evaluation methods in NLG (Reiter and Belz, 2009;Cahill, 2009;Elliott and Keller, 2014), and for training statistical components of NLG systems (Stent et al, 2004;Mairesse and Walker, 2011;Howcroft et al, 2013).…”

Section: Introductionmentioning

confidence: 99%

Reading Times Predict the Quality of Generated Text Above and Beyond Human Ratings

Zarrieß

Loth

Schlangen

2015

Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)

View full text Add to dashboard Cite

Typically, human evaluation of NLG output is based on user ratings. We collected ratings and reading time data in a simple, low-cost experimental paradigm for text generation. Participants were presented corpus texts, automatically linearised texts, and texts containing predicted referring expressions and automatic linearisation. We demonstrate that the reading time metrics outperform the ratings in classifying texts according to their quality. Regression analyses showed that self-reported ratings discriminated poorly between the kinds of manipulation, especially between defects in word order and text coherence. In contrast, a combination of objective measures from the low-cost mouse contingent reading paradigm provided very high classification accuracy and thus, greater insight into the actual quality of an automatically generated text.

show abstract

Comparing Automatic Evaluation Measures for Image Description

Cited by 107 publications

References 12 publications

Learning to Interpret and Describe Abstract Scenes

Learning to Interpret and Describe Abstract Scenes

Déjà Image-Captions: A Corpus of Expressive Descriptions in Repetition

Reading Times Predict the Quality of Generated Text Above and Beyond Human Ratings

Contact Info

Product

Resources

About