Improved Image Captioning via Policy Gradient optimization of SPIDEr

Liu, Siqi; Zhu, Zhenya; Ye, Ning; Guadarrama, Sergio; Murphy, Kevin M.

doi:10.1109/iccv.2017.100

Cited by 336 publications

(238 citation statements)

References 22 publications

Supporting

Mentioning

226

Contrasting

Order By: Relevance

“…Meanwhile, convolutional neural network (CNN) is best‐suited for extracting both global and fine features of an object. Frameworks that combined CNN (encoding spatial information) and RNN (encoding temporal information) have achieved significant success in video prediction . Inspired by these studies, we developed a customized deep learning algorithm that integrated both CNN and RNN units to predict the spatial tumor distribution in a longitudinal imaging study, and evaluated the impact of the structural design on the predictive accuracy.…”

Section: Introductionmentioning

confidence: 99%

“…Frameworks that combined CNN (encoding spatial information) and RNN (encoding temporal information) have achieved significant success in video prediction. [21][22][23] Inspired by these studies, we developed a customized deep learning algorithm that integrated both CNN and RNN units to predict the spatial tumor distribution in a longitudinal imaging study, and evaluated the impact of the structural design on the predictive accuracy. Furthermore, we assessed the characteristics of the prediction including its timing, frequency, and spatial accuracy to prepare for its integration into the clinical workflow of ART.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Toward predicting the evolution of lung tumors during radiotherapy observed on a longitudinal MR imaging study via a deep learning algorithm

Wang

Rimner

et al. 2019

Medical Physics

View full text Add to dashboard Cite

Purpose To predict the spatial and temporal trajectories of lung tumor during radiotherapy monitored under a longitudinal magnetic resonance imaging (MRI) study via a deep learning algorithm for facilitating adaptive radiotherapy (ART). Methods We monitored 10 lung cancer patients by acquiring weekly MRI‐T2w scans over a course of radiotherapy. Under an ART workflow, we developed a predictive neural network (P‐net) to predict the spatial distributions of tumors in the coming weeks utilizing images acquired earlier in the course. The three‐step P‐net consisted of a convolutional neural network to extract relevant features of the tumor and its environment, followed by a recurrence neural network constructed with gated recurrent units to analyze trajectories of tumor evolution in response to radiotherapy, and finally an attention model to weight the importance of weekly observations and produce the predictions. The performance of P‐net was measured with Dice and root mean square surface distance (RMSSD) between the algorithm‐predicted and experts‐contoured tumors under a leave‐one‐out scheme. Results Tumor shrinkage was 60% ± 27% (mean ± standard deviation) by the end of radiotherapy across nine patients. Using images from the first three weeks, P‐net predicted tumors on future weeks (4, 5, 6) with a Dice and RMSSD of (0.78 ± 0.22, 0.69 ± 0.24, 0.69 ± 0.26), and (2.1 ± 1.1 mm, 2.3 ± 0.8 mm, 2.6 ± 1.4 mm), respectively. Conclusion The proposed deep learning algorithm can capture and predict spatial and temporal patterns of tumor regression in a longitudinal imaging study. It closely follows the clinical workflow, and could facilitate the decision‐making of ART. A prospective study including more patients is warranted.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Toward predicting the evolution of lung tumors during radiotherapy observed on a longitudinal MR imaging study via a deep learning algorithm

Wang

Rimner

et al. 2019

Medical Physics

View full text Add to dashboard Cite

show abstract

“…Extensions involve object detectors [42], attention-based deep networks [1], and convolutional approaches [2]. Beyond maximum likelihood, reinforcement learning based techniques have also been discussed to produce a single caption, directly optimizing perceptual metrics [28,33]. All these methods have demonstrated compelling results and have consequently been adopted widely.…”

Section: Related Workmentioning

confidence: 99%

Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning

Aneja

Agrawal

Batra

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

Diverse and accurate vision+language modeling is an important goal to retain creative freedom and maintain user engagement. However, adequately capturing the intricacies of diversity in language models is challenging. Recent works commonly resort to latent variable models augmented with more or less supervision from object detectors or part-of-speech tags [10,40]. Common to all those methods is the fact that the latent variable either only initializes the sentence generation process or is identical across the steps of generation. Both methods offer no fine-grained control. To address this concern, we propose Seq-CVAE which learns a latent space for every word position. We encourage this temporal latent space to capture the 'intention' about how to complete the sentence by mimicking a representation which summarizes the future. We illustrate the efficacy of the proposed approach to anticipate the sentence continuation on the challenging MSCOCO dataset, significantly improving diversity metrics compared to baselines while performing on par w.r.t. sentence quality.

show abstract

“…Inspired by the recent advances in reinforcement learning, several attempts have been made to apply policy gradient algorithms to image captioning task [4,52,56], which could generally be categorized into two groups: policy based and actor-critic based. Policy based methods (e.g., DISC [10], SCST [39], PG-SPIDEr [28], CAVP [25], TD [8]) utilize the unbiased REINFORCE [48] algorithm which optimizes the gradient of the expected reward by sampling a complete sequence from the model during training. To suppress high variance of Monte-Carlo sampling, Self-critical Sequential Training (SCST) [39] utilizes a baseline subtracted from the return which is added to reduce the variance of gradient estimation.…”

Section: Related Work 21 Sentence-level Captioning With Reinforcemenmentioning

confidence: 99%

“…Recently, another line of work tackles the exposure bias and takes advantage of non-differential evaluation feedback by applying reinforcement learning, especially the REINFORCE [48] algorithm for the sentence-level captioning task [8,25,28,39,55]. This strategy reformulates the image captioning as the sequential decision-making process, where the language policy based on its previous decisions is directly optimized.…”

Section: Introductionmentioning

confidence: 99%

Curiosity-driven Reinforcement Learning for Diverse Visual Paragraph Generation

Luo

Huang

Zhang

et al. 2019

Proceedings of the 27th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Visual paragraph generation aims to automatically describe a given image from different perspectives and organize sentences in a coherent way. In this paper, we address three critical challenges for this task in a reinforcement learning setting: the mode collapse, the delayed feedback, and the time-consuming warm-up for policy networks. Generally, we propose a novel Curiosity-driven Reinforcement Learning (CRL) framework to jointly enhance the diversity and accuracy of the generated paragraphs. First, by modeling the paragraph captioning as a long-term decision-making process and measuring the prediction uncertainty of state transitions as intrinsic rewards, the model is incentivized to memorize precise but rarely spotted descriptions to context, rather than being biased towards frequent fragments and generic patterns. Second, since the extrinsic reward from evaluation is only available until the complete paragraph is generated, we estimate its expected value at each time step with temporal-difference learning, by considering the correlations between successive actions. Then the estimated extrinsic rewards are complemented by dense intrinsic rewards produced from the derived curiosity module, in order to encourage the policy to fully explore action space and find a global optimum. Third, discounted imitation learning is integrated for learning from human demonstrations, without separately performing the timeconsuming warm-up in advance. Extensive experiments conducted on the Standford image-paragraph dataset demonstrate the effectiveness and efficiency of the proposed method, improving the performance by 38.4% compared with state-of-the-art.

show abstract

Improved Image Captioning via Policy Gradient optimization of SPIDEr

Cited by 336 publications

References 22 publications

Toward predicting the evolution of lung tumors during radiotherapy observed on a longitudinal MR imaging study via a deep learning algorithm

Toward predicting the evolution of lung tumors during radiotherapy observed on a longitudinal MR imaging study via a deep learning algorithm

Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning

Curiosity-driven Reinforcement Learning for Diverse Visual Paragraph Generation

Contact Info

Product

Resources

About