2020
DOI: 10.48550/arxiv.2009.14405
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Teacher-Critical Training Strategies for Image Captioning

Yiqing Huang,
Jiansheng Chen

Abstract: Existing image captioning models are usually trained by cross-entropy (XE) loss and reinforcement learning (RL), which set ground-truth words as hard targets and force the captioning model to learn from them. However, the widely adopted training strategies suffer from misalignment in XE training and inappropriate reward assignment in RL training. To tackle these problems, we introduce a teacher model that serves as a bridge between the ground-truth caption and the caption model by generating some easier-to-lea… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 20 publications
0
3
0
Order By: Relevance
“…This was later combined with a finetuning phase based on the application of the REINFORCE algorithm, to allow using as optimization objectives captioning metrics directly [9], [23], overcoming the issue of their nondifferentiability and boosting the final performance. As a strategy to improve both training phases, in [51] it is proposed to exploit a teacher model trained on image attributes to generate additional supervision signals for the captioning model. These are in the form of soft-labels, which the captioning model has to align with in the cross-entropy phase, and reweighting of the caption words to guide the fine-tuning phase.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…This was later combined with a finetuning phase based on the application of the REINFORCE algorithm, to allow using as optimization objectives captioning metrics directly [9], [23], overcoming the issue of their nondifferentiability and boosting the final performance. As a strategy to improve both training phases, in [51] it is proposed to exploit a teacher model trained on image attributes to generate additional supervision signals for the captioning model. These are in the form of soft-labels, which the captioning model has to align with in the cross-entropy phase, and reweighting of the caption words to guide the fine-tuning phase.…”
Section: Related Workmentioning
confidence: 99%
“…GCN-LSTM [34], SGAE [35], and MT [36]) or self-attention (i.e. AoANet [7], X-LAN [12], DPA [8], and TCTS [51]), and captioning architectures entirely based on the Transformer network such as ORT [11], M 2 Transformer [13], X-Transformer [12], CPTR [37], DLCT [14], and RSTNet [44].…”
Section: Comparison With the State Of The Artmentioning
confidence: 99%
“…This was later combined with a fine-tuning phase based on the application of the REINFORCE method, to allow use as optimization objectives captioning metrics directly [36,46], boosting the final performance. As a strategy to improve both training phases, in [25] it is proposed to exploit a teacher model trained on image attributes to generate additional supervision signals for the captioning model. These are in the form of soft labels, which the captioning model has to align within the cross-entropy phase, and re-weighting of the caption words to guide the fine-tuning phase.…”
Section: Related Workmentioning
confidence: 99%