On Diversity in Image Captioning: Metrics and Methods

Wang, Qingzhong; Wan, Jia; Chan, Antoni B.

doi:10.1109/tpami.2020.3013834

Cited by 30 publications

(12 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, using multiple models and sampling multiple descriptions may lead to redundancy. Devising image captioning models that produce diverse and distinct fine-grained image descriptions may provide improved performance on CLEVER task; there is an active area of research [59,61] that is looking into this problem.…”

Section: Discussionmentioning

confidence: 99%

The Curious Layperson: Fine-Grained Image Recognition without Expert Labels

Choudhury¹,

Laina²,

Rupprecht³

et al. 2021

Preprint

View full text Add to dashboard Cite

Most of us are not experts in specific fields, such as ornithology. Nonetheless, we do have general image and language understanding capabilities that we use to match what we see to expert resources. This allows us to expand our knowledge and perform novel tasks without ad-hoc external supervision. On the contrary, machines have a much harder time consulting expert-curated knowledge bases unless trained specifically with that knowledge in mind. Thus, in this paper we consider a new problem: fine-grained image recognition without expert annotations, which we address by leveraging the vast knowledge available in web encyclopedias. First, we learn a model to describe the visual appearance of objects using non-expert image descriptions. We then train a finegrained textual similarity model that matches image descriptions with documents on a sentence-level basis. We evaluate the method on two datasets and compare with several strong baselines and the state of the art in cross-modal retrieval. Code is available at: https://github.com/subhc/clever.

show abstract

Section: Discussionmentioning

confidence: 99%

The Curious Layperson: Fine-Grained Image Recognition without Expert Labels

Choudhury¹,

Laina²,

Rupprecht³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The long tail in the empirical distribution of the image caption dataset MSCOCO is introduced in [3]. As in linguistic studies, words with less frequency often contain high information entropy [20], and [52] argues that a long-tail in the word frequency distribution indicates higher diversity in the generated captions. There are two recent works related to the long-tail phenomenon in seq-2-seq models [53], [54].…”

Section: Related Workmentioning

confidence: 99%

“…Following [52], we perform a user study to compare the correlations of CIDErBtw metric (CB) and VSE++ Recall (VR) to human judgment. To show the generalization of our metric to varying degrees of model comparisons, we run experiments on captions from three model pairs: 1) two captions from different models (DiscCap and Stack-Cap); 2) two captions from similar models trained with two different conditions (Transformer+SCST models trained with and without DCR); 3) two captions from the same model (the 1st and 2nd results from beam search on Trans-former+SCST+DCR).…”

Section: Ciderbtw As a Metricmentioning

confidence: 99%

On Distinctive Image Captioning via Comparing and Reweighting

Wang,

Xu,

Wang

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Recent image captioning models are achieving impressive results based on popular metrics, i.e., BLEU, CIDEr, and SPICE. However, focusing on the most popular metrics that only consider the overlap between the generated captions and human annotation could result in using common words and phrases, which lacks distinctiveness, i.e., many similar images have the same caption. In this paper, we aim to improve the distinctiveness of image captions via comparing and reweighting with a set of similar images. First, we propose a distinctiveness metric-between-set CIDEr (CIDErBtw) to evaluate the distinctiveness of a caption with respect to those of similar images. Our metric reveals that the human annotations of each image in the MSCOCO dataset are not equivalent based on distinctiveness; however, previous works normally treat the human annotations equally during training, which could be a reason for generating less distinctive captions. In contrast, we reweight each ground-truth caption according to its distinctiveness during training. We further integrate a long-tailed weight strategy to highlight the rare words that contain more information, and captions from the similar image set are sampled as negative examples to encourage the generated sentence to be unique. Finally, extensive experiments are conducted, showing that our proposed approach significantly improves both distinctiveness (as measured by CIDErBtw and retrieval metrics) and accuracy (e.g., as measured by CIDEr) for a wide variety of image captioning baselines. These results are further confirmed through a user study.

show abstract

“…Another drawback of the standard metrics is that they do not capture (but rather disfavor) the desirable capability of the system to produce novel and diverse captions, which is more in line with the variability with which humans describe complex images. This consideration brought to the development of diversity metrics [132], [139], [140], [141]. Most of these metrics can potentially be calculated even when no ground-truth captions are available at test time.…”

Section: Diversity Metricsmentioning

confidence: 99%

From Show to Tell: A Survey on Deep Learning-based Image Captioning

Stefanini¹,

Cornia²,

Baraldi³

et al. 2021

Preprint

View full text Add to dashboard Cite

Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy.

show abstract

On Diversity in Image Captioning: Metrics and Methods

Cited by 30 publications

References 52 publications

The Curious Layperson: Fine-Grained Image Recognition without Expert Labels

The Curious Layperson: Fine-Grained Image Recognition without Expert Labels

On Distinctive Image Captioning via Comparing and Reweighting

From Show to Tell: A Survey on Deep Learning-based Image Captioning

Contact Info

Product

Resources

About