Learning to Evaluate Image Captioning

Cui, Yin; Yang, Guandao; Veit, Andreas; Huang, Xun; Belongie, Serge

doi:10.1109/cvpr.2018.00608

Cited by 116 publications

(108 citation statements)

References 25 publications

Supporting

Mentioning

102

Contrasting

Unclassified

Order By: Relevance

“…We compare our metrics with mBLEU mix = 1 − 1 4 4 n=1 mBLEU n , which accounts for mBLEU- 4 In our instructions, diversity refers to different words, phrases, sentence structures, semantics or other factors that impact diversity. {1,2,3,4}, and we invert the score so that it is consistent with our diversity metrics (higher values indicate more diversity).…”

Section: Considering Diversity and Accuracymentioning

confidence: 99%

Describing Like Humans: On Diversity in Image Captioning

Wang

Chan

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Recently, the state-of-the-art models for image captioning have overtaken human performance based on the most popular metrics, such as BLEU, METEOR, ROUGE and CIDEr. Does this mean we have solved the task of image captioning? The above metrics only measure the similarity of the generated caption to the human annotations, which reflects its accuracy. However, an image contains many concepts and multiple levels of detail, and thus there is a variety of captions that express different concepts and details that might be interesting for different humans. Therefore only evaluating accuracy is not sufficient for measuring the performance of captioning models -the diversity of the generated captions should also be considered. In this paper, we proposed a new metric for measuring diversity of image captions, which is derived from latent semantic analysis and kernelized to use CIDEr similarity. We conduct extensive experiments to re-evaluate recent captioning models in the context of both diversity and accuracy. We find that there is still a large gap between the model and human performance in terms of both accuracy and diversity, and that models that have optimized accuracy (CIDEr) have low diversity. We also show that balancing the cross-entropy loss and CIDEr reward in reinforcement learning during training can effectively control the tradeoff between diversity and accuracy of the generated captions. arXiv:1903.12020v3 [cs.CV] 15 May 2019 Measuring Diversity of Image CaptionsCurrently, the widely used metrics, such as BLEU, CIDEr, and SPICE are for a single caption prediction. To evaluate a set of captions C = {c 1 , c 2 , · · · , c m }, two dimensions are required: accuracy and diversity. For accuracy,

show abstract

Section: Considering Diversity and Accuracymentioning

confidence: 99%

Describing Like Humans: On Diversity in Image Captioning

Wang

Chan

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

show abstract

“…In contrast, human annotations tend to have varying captions, since the background knowledge of each person varies, leading to lower BMRC metrics. SPICE [1] has been shown to be better correlated with human judgement [6]. Unfortunately, the SPICE metric is currently not available from the online test server.…”

Section: Results On Online Test Setmentioning

confidence: 99%

Gated Hierarchical Attention for Image Captioning

Wang

Chan

2019

Computer Vision – ACCV 2018

View full text Add to dashboard Cite

Attention modules connecting encoder and decoders have been widely applied in the field of object recognition, image captioning, visual question answering and neural machine translation, and significantly improves the performance. In this paper, we propose a bottomup gated hierarchical attention (GHA) mechanism for image captioning. Our proposed model employs a CNN as the decoder which is able to learn different concepts at different layers, and apparently, different concepts correspond to different areas of an image. Therefore, we develop the GHA in which low-level concepts are merged into high-level concepts and simultaneously low-level attended features pass to the top to make predictions. Our GHA significantly improves the performance of the model that only applies one level attention, e.g., the CIDEr score increases from 0.923 to 0.999, which is comparable to the state-of-theart models that employ attributes boosting and reinforcement learning (RL). We also conduct extensive experiments to analyze the CNN decoder and our proposed GHA, and we find that deeper decoders cannot obtain better performance, and when the convolutional decoder becomes deeper the model is likely to collapse during training. Code is available: https://github.com/qingzwang/GHA-ImageCaptioning. Keywords: Hierarchical Attention · Image Captioning · Convolutional Decoder. Recently, CNNs are the most popular vision module, such as VGG nets [33], Google nets [35] and residual nets [14] (in this paper, we call them Image-CNNs). It is believed that introducing more information benefits the performance, and hence some models employ object detection or transfer image features into attributes to obtain more details or semantic information of an image [2,9,46,42,45,11]. However, applying object detection or attributes boosting

show abstract

“…[39,31,57]. [8] train a general critic network to learn to score captions, providing various types of corrupted captions as negatives. [51] use a composite metric, a classifier trained on the automatic scores as input.…”

Section: Related Workmentioning

confidence: 99%

Adversarial Inference for Multi-Sentence Video Description

Park

Rohrbach

Darrell

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

While significant progress has been made in the image captioning task, video description is still in its infancy due to the complex nature of video data. Generating multisentence descriptions for long videos is even more challenging. Among the main issues are the fluency and coherence of the generated descriptions, and their relevance to the video. Recently, reinforcement and adversarial learning based methods have been explored to improve the image captioning models; however, both types of methods suffer from a number of issues, e.g. poor readability and high redundancy for RL and stability issues for GANs. In this work, we instead propose to apply adversarial techniques during inference, designing a discriminator which encourages better multi-sentence video description. In addition, we find that a multi-discriminator "hybrid" design, where each discriminator targets one aspect of a description, leads to the best results. Specifically, we decouple the discriminator to evaluate on three criteria: 1) visual relevance to the video, 2) language diversity and fluency, and 3) coherence across sentences. Our approach results in more accurate, diverse, and coherent multi-sentence video descriptions, as shown by automatic as well as human evaluation on the popular ActivityNet Captions dataset.

show abstract

Learning to Evaluate Image Captioning

Cited by 116 publications

References 25 publications

Describing Like Humans: On Diversity in Image Captioning

Describing Like Humans: On Diversity in Image Captioning

Gated Hierarchical Attention for Image Captioning

Adversarial Inference for Multi-Sentence Video Description

Contact Info

Product

Resources

About