Say As You Wish: Fine-Grained Control of Image Caption Generation With Abstract Scene Graphs

Chen, Shizhe; Jin, Qin; Wang, Peng; Wu, Qi

doi:10.1109/cvpr42600.2020.00998

Cited by 204 publications

(114 citation statements)

References 32 publications

Supporting

Mentioning

103

Contrasting

Order By: Relevance

“…Different from image captioning Vinyals et al, 2015;Lu et al, 2017;Anderson et al, 2018;Liu et al, ,b, 2020) that processes a static image with details of almost every appeared object, video captioning considers a sequence of frames which biases towards focused objects. It is still worth noting that the controllable image captioning has been explored most recently (Cornia et al, 2019;Chen et al, 2020;Zheng et al, 2019). However, all of them are based on autoregressive decoding, i.e., conditioning each word on the previously generated outputs.…”

Section: Controllable Image Captioningmentioning

confidence: 99%

O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning

Liu¹,

Ren²,

Wang³

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

Video captioning combines video understanding and language generation. Different from image captioning that describes a static image with details of almost every object, video captioning usually considers a sequence of frames and biases towards focused objects, e.g., the objects that stay in focus regardless of the changing background. Therefore, detecting and properly accommodating focused objects is critical in video captioning. To enforce the description of focused objects and achieve controllable video captioning, we propose an Object-Oriented Non-Autoregressive approach (O2NA), which performs caption generation in three steps: 1) identify the focused objects and predict their locations in the target caption; 2) generate the related attribute words and relation words of these focused objects to form a draft caption; and 3) combine video information to refine the draft caption to a fluent final caption. Since the focused objects are generated and located ahead of other words, it is difficult to apply the word-by-word autoregressive generation process; instead, we adopt a non-autoregressive approach. The experiments on two benchmark datasets, i.e., MSR-VTT and MSVD, demonstrate the effectiveness of O2NA, which achieves results competitive with the state-of-the-arts but with both higher diversity and higher inference speed.

show abstract

Section: Controllable Image Captioningmentioning

confidence: 99%

O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning

Liu¹,

Ren²,

Wang³

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

show abstract

“…There have been many efforts in understanding vision and language, which focus on making a connection between visual and linguistic information. Various applications need such a connection to realize tagging [7][8][9][10], retrieval [11], captioning [12][13][14], and visual question answering [15][16][17][18].…”

Section: Vision and Language Understandingmentioning

confidence: 99%

“…Then, RNN iteratively transforms each feature vector into words, resulting in a caption. Chen et al use a graph structure to describe object, attribute, and relationship in images [14]. The graph structure facilitates making connections to texts.…”

Section: Vision and Language Understandingmentioning

confidence: 99%

Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence

et al. 2021

View full text Add to dashboard Cite

Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other. Most existing methods put instances in a single embedding space. However, they struggle to embed instances due to the difficulty of matching visual dynamics in videos to textual features in sentences. A single space is not enough to accommodate various videos and sentences. In this paper, we propose a novel framework that maps instances into multiple individual embedding spaces so that we can capture multiple relationships between instances, leading to compelling video retrieval. We propose to produce a final similarity between instances by fusing similarities measured in each embedding space using a weighted sum strategy. We determine the weights according to a sentence. Therefore, we can flexibly emphasize an embedding space. We conducted sentence-to-video retrieval experiments on a benchmark dataset. The proposed method achieved superior performance, and the results are competitive to state-of-the-art methods. These experimental results demonstrated the effectiveness of the proposed multiple embedding approach compared to existing methods.

show abstract

“…Huang et al [20,21] added attention mechanism to image captioning. Chen et al [22,23] processed natural languages in video captioning, focusing on image objects. Most strategies in the last two years, such as dual-stream recurrent neural network, object relational graph (ORG) with teacher-recommended learning (TRL), and spatio-temporal graph with knowledge distillation (STG-KD) [24][25][26], are optimized with features of video images.…”

Section: Literature Reviewmentioning

confidence: 99%

Label Importance Ranking with Entropy Variation Complex Networks for Structured Video Captioning

Wei¹,

Hu²

2021

View full text Add to dashboard Cite

Structured video captioning is a fundamental yet challenging task in both computer vision and artificial intelligence (AI). The prevalent approach is to map an input video to a variablelength output sentence with models like recurrent neural network (RNN). This paper presents a new model based on an improved scene-aware bidirectional long short-term memory network (SABi-LSTM), and names the model as label importance ranking with entropy variation complex networks of structured video captions. Structured video captioning is a three-level structured system, including a multi-feature fusion level, an SABi-LSTM level, and a label importance ranking level. The system decomposes structures of multiple levels and dimensions from different perspectives to perform video captioning. This work affirms the theoretical and practical significance of label importance ranking to video caption generation, and regards entropy as a local level metric to quantify label importance. Hence, entropy variation was proposed to define label importance, namely, the variation of the network entropy through label removal. It is assumed that the removal of an important label could cause sustainable variation to the structure. Hence, the authors defined the label importance ranking with entropy variation complex network algorithm to calculate the weight model of label nodes marked by video, and obtain the final caption of the video. Empirical results on Microsoft Video Caption (MSVD) dataset and MSR-Video to Text (MSR-VTT) dataset demonstrate the superiority of our approach for structured video captioning, especially on MSVD dataset.

show abstract

Say As You Wish: Fine-Grained Control of Image Caption Generation With Abstract Scene Graphs

Cited by 204 publications

References 32 publications

O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning

O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning

Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence

Label Importance Ranking with Entropy Variation Complex Networks for Structured Video Captioning

Contact Info

Product

Resources

About