2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020
DOI: 10.1109/cvpr42600.2020.00998
|View full text |Cite
|
Sign up to set email alerts
|

Say As You Wish: Fine-Grained Control of Image Caption Generation With Abstract Scene Graphs

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
103
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
3
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 204 publications
(114 citation statements)
references
References 32 publications
0
103
0
Order By: Relevance
“…Different from image captioning Vinyals et al, 2015;Lu et al, 2017;Anderson et al, 2018;Liu et al, ,b, 2020) that processes a static image with details of almost every appeared object, video captioning considers a sequence of frames which biases towards focused objects. It is still worth noting that the controllable image captioning has been explored most recently (Cornia et al, 2019;Chen et al, 2020;Zheng et al, 2019). However, all of them are based on autoregressive decoding, i.e., conditioning each word on the previously generated outputs.…”
Section: Controllable Image Captioningmentioning
confidence: 99%
“…Different from image captioning Vinyals et al, 2015;Lu et al, 2017;Anderson et al, 2018;Liu et al, ,b, 2020) that processes a static image with details of almost every appeared object, video captioning considers a sequence of frames which biases towards focused objects. It is still worth noting that the controllable image captioning has been explored most recently (Cornia et al, 2019;Chen et al, 2020;Zheng et al, 2019). However, all of them are based on autoregressive decoding, i.e., conditioning each word on the previously generated outputs.…”
Section: Controllable Image Captioningmentioning
confidence: 99%
“…There have been many efforts in understanding vision and language, which focus on making a connection between visual and linguistic information. Various applications need such a connection to realize tagging [7][8][9][10], retrieval [11], captioning [12][13][14], and visual question answering [15][16][17][18].…”
Section: Vision and Language Understandingmentioning
confidence: 99%
“…Then, RNN iteratively transforms each feature vector into words, resulting in a caption. Chen et al use a graph structure to describe object, attribute, and relationship in images [14]. The graph structure facilitates making connections to texts.…”
Section: Vision and Language Understandingmentioning
confidence: 99%
“…Huang et al [20,21] added attention mechanism to image captioning. Chen et al [22,23] processed natural languages in video captioning, focusing on image objects. Most strategies in the last two years, such as dual-stream recurrent neural network, object relational graph (ORG) with teacher-recommended learning (TRL), and spatio-temporal graph with knowledge distillation (STG-KD) [24][25][26], are optimized with features of video images.…”
Section: Literature Reviewmentioning
confidence: 99%