Quality Estimation for Image Captions Based on Large-scale Human Evaluations

Levinboim, Tomer; Thapliyal, Ashish; Sharma, Piyush; Soricut, Radu

doi:10.18653/v1/2021.naacl-main.253

Cited by 12 publications

(8 citation statements)

References 23 publications

(22 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Significant prior work has explored the collection of human judgments of the quality of visual descriptions. Human judgment is considered the gold standard for visual description evaluation, and previous studies typically rely on human annotators to rate caption quality on one or multiple axes (Levinboim et al, 2021;Kasai et al, 2022). While automated methods exist for the evaluation of caption quality (Agarwal and Lavie, 2008;Vedantam et al, 2015;Papineni et al, 2002), recent work including THUMB (Kasai et al, 2022), which has run human evaluations on captions produced by models based on "Precision", "Recall", "Fluency", "Conciseness" and "Inclusive Language", has shown that humans produce captions which score significantly higher when judged by human raters, than when judged by existing measures (and further, that human judgments of quality correlate poorly with existing measures), necessitating the need for human evaluation as opposed to evaluation of captioning methods using automated measures for caption quality.…”

Section: F Human Studiesmentioning

confidence: 99%

IC3: Image Captioning by Committee Consensus

Chan,

Myers,

Vijayanarasimhan

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

If you ask a human to describe an image, they might do so in a thousand different ways. Image captioning models, on the other hand, are traditionally trained to generate a single "best" (most like a reference) caption. Unfortunately, doing so encourages captions that are informationally impoverished. Such captions often focus on only a subset of possible details, while ignoring other potentially useful information in the scene. In this work, we introduce a simple, yet novel, method: "Image Captioning by Committee Consensus" (IC 3 ), designed to generate a single caption that captures details from multiple viewpoints by sampling from the learned semantic space of a base captioning model, and carefully leveraging a large language model to synthesize these samples into a single comprehensive caption. Our evaluations show that humans rate captions produced by IC 3 more helpful than those produced by SOTA models more than two-thirds of the time, and IC 3 improves the performance of SOTA automated recall systems by up to 84%, outperforming single human-generated reference captions and indicating significant improvements over SOTA approaches for visual description. Code/Resources are available at https://davidmchan. github.io/caption-by-committee.

show abstract

Section: F Human Studiesmentioning

confidence: 99%

IC3: Image Captioning by Committee Consensus

Chan,

Myers,

Vijayanarasimhan

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…Many large-scale datasets have been created [6,[10][11][12][13][14][15][16][17][18][19][20], but in contrast to the previous ones, they employ automated pipelines. One example of a dataset that follows this approach is the Conceptual Captions dataset [5] which has more than 3.3M pairs of images and English captions.…”

Section: Related Workmentioning

confidence: 99%

“…[1][2][3][4][5][6][7][8][9][10],, [101-1000], [1001-10,000],[10,[1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20],000], and 20,001 or more.…”

unclassified

#PraCegoVer: A Large Dataset for Image Captioning in Portuguese

Santos

Colombini

Avila

2022

Data

View full text Add to dashboard Cite

Automatically describing images using natural sentences is essential to visually impaired people’s inclusion on the Internet. This problem is known as Image Captioning. There are many datasets in the literature, but most contain only English captions, whereas datasets with captions described in other languages are scarce. We introduce the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese. In contrast to popular datasets, #PraCegoVer has only one reference per image, and both mean and variance of reference sentence length are significantly high, which makes our dataset challenging due to its linguistic aspect. We carry a detailed analysis to find the main classes and topics in our data. We compare #PraCegoVer to MS COCO dataset in terms of sentence length and word frequency. We hope that #PraCegoVer dataset encourages more works addressing the automatic generation of descriptions in Portuguese.

show abstract

“…QE is widely established in machine translation (MT) tasks (Specia et al, 2013;Martins et al, 2017;Specia et al, 2018). Recently, (Levinboim et al, 2021) introduces a large scale human ratings on image-caption pairs for training QE models in image captioning tasks. Our work also trains caption QE model, (i.e.…”

Section: Related Workmentioning

confidence: 99%

UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning

Lee¹,

Yoon²,

Dernoncourt³

et al. 2021

Preprint

View full text Add to dashboard Cite

Despite the success of various text generation metrics such as BERTScore, it is still difficult to evaluate the image captions without enough reference captions due to the diversity of the descriptions. In this paper, we introduce a new metric UMIC, an Unreferenced Metric for Image Captioning which does not require reference captions to evaluate image captions. Based on Vision-and-Language BERT, we train UMIC to discriminate negative captions via contrastive learning. Also, we observe critical problems of the previous benchmark dataset (i.e., human annotations) on image captioning metric, and introduce a new collection of human annotations on the generated captions. We validate UMIC on four datasets, including our new dataset, and show that UMIC has a higher correlation than all previous metrics that require multiple references. We release the benchmark dataset and pre-trained models to compute the UMIC 1 .1 https://github.com/hwanheelee1993/UMIC Ref 1: A dog standing in the snow with a stick in its mouth. Ref 2: A little dog holding sticks in its mouth. Candidate: A dog standing on the snow with a dog CIDEr with Ref 1: 3.166 CIDEr with Ref 2: 0.281

show abstract

Quality Estimation for Image Captions Based on Large-scale Human Evaluations

Cited by 12 publications

References 23 publications

IC3: Image Captioning by Committee Consensus

IC3: Image Captioning by Committee Consensus

#PraCegoVer: A Large Dataset for Image Captioning in Portuguese

UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning

Contact Info

Product

Resources

About