2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
DOI: 10.1109/cvpr.2017.127
|View full text |Cite
|
Sign up to set email alerts
|

Semantic Compositional Networks for Visual Captioning

Abstract: A Semantic Compositional Network (SCN) is developed for image captioning, in which semantic concepts (i.e., tags) are detected from the image, and the probability of each tag is used to compose the parameters in a long short-term memory (LSTM) network. The SCN extends each weight matrix of the LSTM to an ensemble of tag-dependent weight matrices. The degree to which each member of the ensemble is used to generate an image caption is tied to the image-dependent probability of the corresponding tag. In addition … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

5
268
0
1

Year Published

2018
2018
2019
2019

Publication Types

Select...
5
4

Relationship

2
7

Authors

Journals

citations
Cited by 408 publications
(274 citation statements)
references
References 41 publications
5
268
0
1
Order By: Relevance
“…AoA generates an information vector and an attention gate using the attention result and the attention query, and adds another attention by applying the gate to the information and obtains the attended information. [42,27,47,2,7,16,15] and achieved impressive results. In such a framework for image captioning, an image is first encoded to a set of feature vectors via a CNN based network and then decoded to words via an RNN based network, where the attention mechanism guides the decoding process by generating a weighted average over the extracted feature vectors for each time step.…”
Section: Introductionmentioning
confidence: 98%
“…AoA generates an information vector and an attention gate using the attention result and the attention query, and adds another attention by applying the gate to the information and obtains the attended information. [42,27,47,2,7,16,15] and achieved impressive results. In such a framework for image captioning, an image is first encoded to a set of feature vectors via a CNN based network and then decoded to words via an RNN based network, where the attention mechanism guides the decoding process by generating a weighted average over the extracted feature vectors for each time step.…”
Section: Introductionmentioning
confidence: 98%
“…We acknowledge that more recent papers (Xu et al, 2017;Rennie et al, 2017;Yao et al, 2017;Lu et al, 2017;Gan et al, 2017) reported better performance on the task of image captioning. Performance improvements in these more recent models are mainly due to using better image features such as those obtained by Region-based Convolutional Neural Networks (R-CNN), or using reinforcement learning (RL) to directly optimize metrics such as CIDEr, or using more complex attention mechanisms (Gan et al, 2017) to provide a better context vector for caption generation, or using an ensemble of multiple LSTMs, among others. However, the LSTM is still playing a core role in these works and we believe improvement over the core LSTM, in both performance and interpretability, is still very valuable; that is why we compare the proposed TPGN with a state-of-the-art native LSTM (the second line of Table 5.2).…”
Section: Discussionmentioning
confidence: 74%
“…where v ∈ R 2048 is the vector of visual features extracted from the current image by ResNet (Gan et al, 2017) andv is the mean of all such vectors; C s ∈ R (d×d)×2048 . On the output side, x t ∈ R V is a 1-hot vector with dimension equal to the size of the caption vocabulary, V , and W e ∈ R d×V is a word embedding matrix, the i-th column of which is the embedding vector of the i-th word in the vocabulary; it is obtained by the Stanford GLoVe algorithm with zero mean (Pennington et al, 2017).…”
Section: System Descriptionmentioning
confidence: 99%
“…In this subsection, we compare our method with the state-of-the-art methods with multiple features on benchmark datasets, including SA [53], M3 [47], v2t navigator [18], Aalto [36], VideoLab [31], MA-LSTM [51], M&M-TGM [4], PickNet [8], LSTM-TSA IV [28], SibNet [23], MGSA [5], and SCN-LSTM [14], most of which fuse different features by simply concatenating.…”
Section: Performance Comparisonsmentioning
confidence: 99%