Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale

Çağlayan, Ozan; Madhyastha, Pranava; Specia, Lucia

doi:10.18653/v1/2020.coling-main.210

Cited by 18 publications

(15 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To better assess the performance of a captioning system, it is common practice to consider a set of the above-mentioned standard metrics. Nevertheless, these are somehow gameable because they favor word similarity rather than meaning correctness [138]. Another drawback of the standard metrics is that they do not capture (but rather disfavor) the desirable capability of the system to produce novel and diverse captions, which is more in line with the variability with which humans describe complex images.…”

Section: Diversity Metricsmentioning

confidence: 99%

From Show to Tell: A Survey on Deep Learning-based Image Captioning

Stefanini¹,

Cornia²,

Baraldi³

et al. 2021

Preprint

View full text Add to dashboard Cite

Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy.

show abstract

Section: Diversity Metricsmentioning

confidence: 99%

From Show to Tell: A Survey on Deep Learning-based Image Captioning

Stefanini¹,

Cornia²,

Baraldi³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Recent work has shown that other metrics, such as diversity of outputs, are important for evaluating the quality of LMs as models for language generation (Hashimoto et al, 2019;Caccia et al, 2020). Generation also depends on a number of other factors, such as choice of decoding procedure (Caglayan et al, 2020). Here, we focus on LMs as predictive models, measuring their ability to place an accurate distribution over future words and sentences, rather than their ability to generate useful or coherent text (see Appendix C).…”

Section: |Nmentioning

confidence: 99%

What Context Features Can Transformer Language Models Use?

O'Connor¹,

Andreas²

2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Transformer-based language models benefit from conditioning on contexts of hundreds to thousands of previous tokens. What aspects of these contexts contribute to accurate model prediction? We describe a series of experiments that measure usable information by selectively ablating lexical and structural information in transformer language models trained on English Wikipedia. In both mid-and longrange contexts, we find that several extremely destructive context manipulations-including shuffling word order within sentences and deleting all words other than nouns-remove less than 15% of the usable information. Our results suggest that long contexts, but not their detailed syntactic and propositional content, are important for the low perplexity of current transformer language models. 1

show abstract

“…For open-ended text generation tasks like answering why-questions, the absence of an automatic evaluation that correlates well with human judgments is a major challenge (Chen et al, 2019;Ma et al, 2019;Caglayan et al, 2020;Howcroft et al, 2020).…”

Section: Human Evaluationmentioning

confidence: 99%

TellMeWhy: A Dataset for Answering Why-Questions in Narratives

Lal¹,

Chambers²,

Mooney³

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

Answering questions about why characters perform certain actions is central to understanding and reasoning about narratives. Despite recent progress in QA, it is not clear if existing models have the ability to answer "why" questions that may require commonsense knowledge external to the input narrative. In this work, we introduce TellMeWhy, a new crowd-sourced dataset that consists of more than 30k questions and free-form answers concerning why characters in short narratives perform the actions described. For a third of this dataset, the answers are not present within the narrative. Given the limitations of automated evaluation for this task, we also present a systematized human evaluation interface for this dataset. Our evaluation of state-of-the-art models show that they are far below human performance on answering such questions. They are especially worse on questions whose answers are external to the narrative, thus providing a challenge for future QA and narrative understanding research.

show abstract

Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale

Cited by 18 publications

References 20 publications

From Show to Tell: A Survey on Deep Learning-based Image Captioning

From Show to Tell: A Survey on Deep Learning-based Image Captioning

What Context Features Can Transformer Language Models Use?

TellMeWhy: A Dataset for Answering Why-Questions in Narratives

Contact Info

Product

Resources

About