2017 IEEE International Conference on Computer Vision (ICCV) 2017
DOI: 10.1109/iccv.2017.100
|View full text |Cite
|
Sign up to set email alerts
|

Improved Image Captioning via Policy Gradient optimization of SPIDEr

Abstract: Current image captioning methods are usually trained via (penalized) maximum likelihood estimation. However, the log-likelihood score of a caption does not correlate well with human assessments of quality. Standard syntactic evaluation metrics, such as BLEU, METEOR and ROUGE, are also not well correlated. The newer SPICE and CIDEr metrics are better correlated, but have traditionally been hard to optimize for. In this paper, we show how to use a policy gradient (PG) method to directly optimize a linear combina… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
226
0

Year Published

2018
2018
2019
2019

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 336 publications
(238 citation statements)
references
References 22 publications
0
226
0
Order By: Relevance
“…Meanwhile, convolutional neural network (CNN) is best‐suited for extracting both global and fine features of an object. Frameworks that combined CNN (encoding spatial information) and RNN (encoding temporal information) have achieved significant success in video prediction . Inspired by these studies, we developed a customized deep learning algorithm that integrated both CNN and RNN units to predict the spatial tumor distribution in a longitudinal imaging study, and evaluated the impact of the structural design on the predictive accuracy.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Meanwhile, convolutional neural network (CNN) is best‐suited for extracting both global and fine features of an object. Frameworks that combined CNN (encoding spatial information) and RNN (encoding temporal information) have achieved significant success in video prediction . Inspired by these studies, we developed a customized deep learning algorithm that integrated both CNN and RNN units to predict the spatial tumor distribution in a longitudinal imaging study, and evaluated the impact of the structural design on the predictive accuracy.…”
Section: Introductionmentioning
confidence: 99%
“…Frameworks that combined CNN (encoding spatial information) and RNN (encoding temporal information) have achieved significant success in video prediction. [21][22][23] Inspired by these studies, we developed a customized deep learning algorithm that integrated both CNN and RNN units to predict the spatial tumor distribution in a longitudinal imaging study, and evaluated the impact of the structural design on the predictive accuracy. Furthermore, we assessed the characteristics of the prediction including its timing, frequency, and spatial accuracy to prepare for its integration into the clinical workflow of ART.…”
Section: Introductionmentioning
confidence: 99%
“…Extensions involve object detectors [42], attention-based deep networks [1], and convolutional approaches [2]. Beyond maximum likelihood, reinforcement learning based techniques have also been discussed to produce a single caption, directly optimizing perceptual metrics [28,33]. All these methods have demonstrated compelling results and have consequently been adopted widely.…”
Section: Related Workmentioning
confidence: 99%
“…Inspired by the recent advances in reinforcement learning, several attempts have been made to apply policy gradient algorithms to image captioning task [4,52,56], which could generally be categorized into two groups: policy based and actor-critic based. Policy based methods (e.g., DISC [10], SCST [39], PG-SPIDEr [28], CAVP [25], TD [8]) utilize the unbiased REINFORCE [48] algorithm which optimizes the gradient of the expected reward by sampling a complete sequence from the model during training. To suppress high variance of Monte-Carlo sampling, Self-critical Sequential Training (SCST) [39] utilizes a baseline subtracted from the return which is added to reduce the variance of gradient estimation.…”
Section: Related Work 21 Sentence-level Captioning With Reinforcemenmentioning
confidence: 99%
“…Recently, another line of work tackles the exposure bias and takes advantage of non-differential evaluation feedback by applying reinforcement learning, especially the REINFORCE [48] algorithm for the sentence-level captioning task [8,25,28,39,55]. This strategy reformulates the image captioning as the sequential decision-making process, where the language policy based on its previous decisions is directly optimized.…”
Section: Introductionmentioning
confidence: 99%