Actor-Critic Sequence Training for Image Captioning

Zhang, Li; Sung, Flood; Liu, Feng; Xiang, Tao; Gong, Shaogang; Yang, Yongxin; Hospedales, Timothy M.

doi:10.48550/arxiv.1706.09601

Cited by 31 publications

(29 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It assumes that each token makes the same contribution towards the sentence. Actor-Critic based method [20] was also applied to image captioning by utilizing two networks as Actor and Critic respectively. Ren et al [3] recast image captioning into decision-making framework and utilized a policy network to choose the next word and a value network to evaluate the policy.…”

Section: Sequential Decision-makingmentioning

confidence: 99%

“…An emerging line of endowing machine reasoning is to execute deep reinforcement learning (RL) in the sequence prediction task of image captioning [3], [18], [19], [20]. As illustrated in Figure 1a, we first frame the traditional encoder-decoder image captioning into a decision-making process, where the visual encoder can be viewed as Visual Policy (VP) that decides where to hold a gaze in the image, and the language decoder can be viewed as Language Policy (LP) that decides what the next word is.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Context-Aware Visual Policy Network for Fine-Grained Image Captioning

Zha

Liu

Zhang

et al. 2022

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

With the maturity of visual detection techniques, we are more ambitious in describing visual content with open-vocabulary, fine-grained and free-form language, i.e., the task of image captioning. In particular, we are interested in generating longer, richer and more fine-grained sentences and paragraphs as image descriptions. Image captioning can be translated to the task of sequential language prediction given visual content, where the output sequence forms natural language description with plausible grammar. However, existing image captioning methods focus only on language policy while not visual policy, and thus fail to capture visual context that is crucial for compositional reasoning such as object relationships (e.g., "man riding horse") and visual comparisons (e.g., "small(er) cat"). This issue is especially severe when generating longer sequences such as a paragraph. To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for fine-grained image-to-language generation: image sentence captioning and image paragraph captioning. During captioning, CAVP explicitly considers the previous visual attentions as context and decides whether the context is used for the current word/sentence generation given the current visual attention. Compared against the traditional visual attention mechanism that only fixes a single visual region at each step, CAVP can attend to complex visual compositions over time. The whole image captioning model -CAVP and its subsequent language policy network -can be efficiently optimized end-to-end by using an actor-critic policy gradient method. We have demonstrated the effectiveness of CAVP by state-of-the-art performances on MS-COCO and Stanford captioning datasets, using various metrics and sensible visualizations of qualitative visual context.

show abstract

Section: Sequential Decision-makingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Context-Aware Visual Policy Network for Fine-Grained Image Captioning

Zha

Liu

Zhang

et al. 2022

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

show abstract

“…Discriminative reward for each word is not considered under this training strategy. To remedy this, Zhang et al (2017) used another RNN to predict the state value function for different words. However, the value they computed is not directly related to the evaluation scores, which introduces estimation bias.…”

Section: Related Workmentioning

confidence: 99%

Teacher-Critical Training Strategies for Image Captioning

Huang,

Chen

2020

Preprint

View full text Add to dashboard Cite

Existing image captioning models are usually trained by cross-entropy (XE) loss and reinforcement learning (RL), which set ground-truth words as hard targets and force the captioning model to learn from them. However, the widely adopted training strategies suffer from misalignment in XE training and inappropriate reward assignment in RL training. To tackle these problems, we introduce a teacher model that serves as a bridge between the ground-truth caption and the caption model by generating some easier-to-learn word proposals as soft targets. The teacher model is constructed by incorporating the ground-truth image attributes into the baseline caption model. To effectively learn from the teacher model, we propose Teacher-Critical Training Strategies (TCTS) for both XE and RL training to facilitate better learning processes for the caption model. Experimental evaluations of several widely adopted caption models on the benchmark MSCOCO dataset show the proposed TCTS comprehensively enhances most evaluation metrics, especially the Bleu and Rouge-L scores, in both training stages. TCTS is able to achieve to-date the best published single model Bleu-4 and Rouge-L performances of 40.2% and 59.4% on the MSCOCO Karpathy test split. Our codes and pre-trained models will be open-sourced.

show abstract

“…For example, Ranzato et al [36] defined the reward based on a sequence-level metric (e.g., bilingual evaluation understudy (BLEU) [38] or recall-oriented understudy for gisting evaluation (ROUGE) [39]) that was used as an evaluation metric during the test stage to train the captioning model, thus leading to a notable performance improvement. Similarly, Zhang et al [40] designed an actorcritic algorithm that formulated a per-token advantage function and value estimation strategy into the reinforcement-learningbased captioning model to directly optimize non-differentiable quality metrics of interest. Rennie et al [5] proposed a selfcritical sequence training approach that normalized the rewards using the output of its own test-time inference algorithm for steadier training.…”

Section: Related Workmentioning

confidence: 99%

Fine-Grained Image Captioning with Global-Local Discriminative Objective

Chen²,

et al. 2020

Preprint

View full text Add to dashboard Cite

Significant progress has been made in recent years in image captioning, an active topic in the fields of vision and language. However, existing methods tend to yield overly general captions and consist of some of the most frequent words/phrases, resulting in inaccurate and indistinguishable descriptions (see Figure 1). This is primarily due to (i) the conservative characteristic of traditional training objectives that drives the model to generate correct but hardly discriminative captions for similar images and (ii) the uneven word distribution of the groundtruth captions, which encourages generating highly frequent words/phrases while suppressing the less frequent but more concrete ones. In this work, we propose a novel global-local discriminative objective that is formulated on top of a reference model to facilitate generating fine-grained descriptive captions. Specifically, from a global perspective, we design a novel global discriminative constraint that pulls the generated sentence to better discern the corresponding image from all others in the entire dataset. From the local perspective, a local discriminative constraint is proposed to increase attention such that it emphasizes the less frequent but more concrete words/phrases, thus facilitating the generation of captions that better describe the visual details of the given images. We evaluate the proposed method on the widely used MS-COCO dataset, where it outperforms the baseline methods by a sizable margin and achieves competitive performance over existing leading approaches. We also conduct self-retrieval experiments to demonstrate the discriminability of the proposed method.

show abstract

Actor-Critic Sequence Training for Image Captioning

Cited by 31 publications

References 20 publications

Context-Aware Visual Policy Network for Fine-Grained Image Captioning

Context-Aware Visual Policy Network for Fine-Grained Image Captioning

Teacher-Critical Training Strategies for Image Captioning

Fine-Grained Image Captioning with Global-Local Discriminative Objective

Contact Info

Product

Resources

About