2017
DOI: 10.48550/arxiv.1706.09601
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Actor-Critic Sequence Training for Image Captioning

Abstract: Generating natural language descriptions of images is an important capability for a robot or other visual-intelligence driven AI agent that may need to communicate with human users about what it is seeing. Such image captioning methods are typically trained by maximising the likelihood of ground-truth annotated caption given the image. While simple and easy to implement, this approach does not directly maximise the language quality metrics we care about such as CIDEr. In this paper we investigate training imag… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
29
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 31 publications
(29 citation statements)
references
References 20 publications
0
29
0
Order By: Relevance
“…It assumes that each token makes the same contribution towards the sentence. Actor-Critic based method [20] was also applied to image captioning by utilizing two networks as Actor and Critic respectively. Ren et al [3] recast image captioning into decision-making framework and utilized a policy network to choose the next word and a value network to evaluate the policy.…”
Section: Sequential Decision-makingmentioning
confidence: 99%
See 1 more Smart Citation
“…It assumes that each token makes the same contribution towards the sentence. Actor-Critic based method [20] was also applied to image captioning by utilizing two networks as Actor and Critic respectively. Ren et al [3] recast image captioning into decision-making framework and utilized a policy network to choose the next word and a value network to evaluate the policy.…”
Section: Sequential Decision-makingmentioning
confidence: 99%
“…An emerging line of endowing machine reasoning is to execute deep reinforcement learning (RL) in the sequence prediction task of image captioning [3], [18], [19], [20]. As illustrated in Figure 1a, we first frame the traditional encoder-decoder image captioning into a decision-making process, where the visual encoder can be viewed as Visual Policy (VP) that decides where to hold a gaze in the image, and the language decoder can be viewed as Language Policy (LP) that decides what the next word is.…”
Section: Introductionmentioning
confidence: 99%
“…Discriminative reward for each word is not considered under this training strategy. To remedy this, Zhang et al (2017) used another RNN to predict the state value function for different words. However, the value they computed is not directly related to the evaluation scores, which introduces estimation bias.…”
Section: Related Workmentioning
confidence: 99%
“…For example, Ranzato et al [36] defined the reward based on a sequence-level metric (e.g., bilingual evaluation understudy (BLEU) [38] or recall-oriented understudy for gisting evaluation (ROUGE) [39]) that was used as an evaluation metric during the test stage to train the captioning model, thus leading to a notable performance improvement. Similarly, Zhang et al [40] designed an actorcritic algorithm that formulated a per-token advantage function and value estimation strategy into the reinforcement-learningbased captioning model to directly optimize non-differentiable quality metrics of interest. Rennie et al [5] proposed a selfcritical sequence training approach that normalized the rewards using the output of its own test-time inference algorithm for steadier training.…”
Section: Related Workmentioning
confidence: 99%