2020
DOI: 10.1609/aaai.v34i03.5655
|View full text |Cite
|
Sign up to set email alerts
|

Reinforcing an Image Caption Generator Using Off-Line Human Feedback

Abstract: Human ratings are currently the most accurate way to assess the quality of an image captioning model, yet most often the only used outcome of an expensive human rating evaluation is a few overall statistics over the evaluation dataset. In this paper, we show that the signal from instance-level human caption ratings can be leveraged to improve captioning models, even when the amount of caption ratings is several orders of magnitude less than the caption training data. We employ a policy gradient method to maxim… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 17 publications
(4 citation statements)
references
References 15 publications
0
4
0
Order By: Relevance
“…For example, Neural Image Captioning (NIC) [4] treated the image captioning task as a translation problem from vision to text, and it was the first image captioning model that exploited the encoder-decoder paradigm. After that, the attention mechanisms [7,21,22], the training strategy with reinforcement learning [23][24][25], and the large-scale pre-trained visionlanguage models [26,27] enrich the captioning methods based on deep learning.…”
Section: Image Captioningmentioning
confidence: 99%
“…For example, Neural Image Captioning (NIC) [4] treated the image captioning task as a translation problem from vision to text, and it was the first image captioning model that exploited the encoder-decoder paradigm. After that, the attention mechanisms [7,21,22], the training strategy with reinforcement learning [23][24][25], and the large-scale pre-trained visionlanguage models [26,27] enrich the captioning methods based on deep learning.…”
Section: Image Captioningmentioning
confidence: 99%
“…The easy-to-hard spirit of CL (Bengio et al, 2009) is inspired by the learning process of us humans. This human-like effective training paradigm has been extensively exploited in different vision-language tasks Seo et al, 2020;Zheng et al, 2022;Yao et al, 2021). Recent works (Lao et al, 2021;Pan et al, 2022) tried to introduce CL into VQA to reduce language biases by helping VQA models gradually focus on more biased samples (Lao et al, 2021) or gradually increasing the importance of visual features in the training phase (Pan et al, 2022).…”
Section: Related Workmentioning
confidence: 99%
“…Existing image captioning approaches typically follow the encoder-decoder architecture (Xu et al 2015;Huang et al 2019;Guo et al 2020;Cornia et al 2020;Zhao, Wu, and Zhang 2020;Seo et al 2020), which takes an image as input and generates a description in the form of natural language. Earlier works (Xu et al 2015;Lu et al 2017;Jiang et al 2020) apply grid-based features as input to generate captions, which are fixed-size patches extracted from the CNN (He et al 2016; ?)…”
Section: Related Workmentioning
confidence: 99%