Semantic Compositional Networks for Visual Captioning

Gan, Zhe; Gan, Chuang; He, Xiaodong; Pu, Yunchen; Tran, Kenneth; Gao, Jianfeng; Carin, Lawrence; Deng, Li

doi:10.1109/cvpr.2017.127

Cited by 408 publications

(274 citation statements)

References 41 publications

Supporting

Mentioning

268

Contrasting

Unclassified

Order By: Relevance

“…AoA generates an information vector and an attention gate using the attention result and the attention query, and adds another attention by applying the gate to the information and obtains the attended information. [42,27,47,2,7,16,15] and achieved impressive results. In such a framework for image captioning, an image is first encoded to a set of feature vectors via a CNN based network and then decoded to words via an RNN based network, where the attention mechanism guides the decoding process by generating a weighted average over the extracted feature vectors for each time step.…”

Section: Introductionmentioning

confidence: 98%

Attention on Attention for Image Captioning

Huang

Wang

Chen

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

724

565

View full text Add to dashboard Cite

Attention mechanisms are widely used in current encoder/decoder frameworks of image captioning, where a weighted average on encoded vectors is generated at each time step to guide the caption decoding process. However, the decoder has little idea of whether or how well the attended vector and the given attention query are related, which could make the decoder give misled results. In this paper, we propose an "Attention on Attention" (AoA) module, which extends the conventional attention mechanisms to determine the relevance between attention results and queries. AoA first generates an "information vector" and an "attention gate" using the attention result and the current context, then adds another attention by applying element-wise multiplication to them and finally obtains the "attended information", the expected useful knowledge. We apply AoA to both the encoder and the decoder of our image captioning model, which we name as AoA Network (AoANet). Experiments show that AoANet outperforms all previously published methods and achieves a new state-ofthe-art performance of 129.8 CIDEr-D score on MS COCO "Karpathy" offline test split and 129.6 CIDEr-D (C40) score on the official online testing server. Code is available at https://github.com/husthuaan/AoANet.

show abstract

Section: Introductionmentioning

confidence: 98%

Attention on Attention for Image Captioning

Huang

Wang

Chen

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

724

565

View full text Add to dashboard Cite

show abstract

“…We acknowledge that more recent papers (Xu et al, 2017;Rennie et al, 2017;Yao et al, 2017;Lu et al, 2017;Gan et al, 2017) reported better performance on the task of image captioning. Performance improvements in these more recent models are mainly due to using better image features such as those obtained by Region-based Convolutional Neural Networks (R-CNN), or using reinforcement learning (RL) to directly optimize metrics such as CIDEr, or using more complex attention mechanisms (Gan et al, 2017) to provide a better context vector for caption generation, or using an ensemble of multiple LSTMs, among others. However, the LSTM is still playing a core role in these works and we believe improvement over the core LSTM, in both performance and interpretability, is still very valuable; that is why we compare the proposed TPGN with a state-of-the-art native LSTM (the second line of Table 5.2).…”

Section: Discussionmentioning

confidence: 74%

“…where v ∈ R 2048 is the vector of visual features extracted from the current image by ResNet (Gan et al, 2017) andv is the mean of all such vectors; C s ∈ R (d×d)×2048 . On the output side, x t ∈ R V is a 1-hot vector with dimension equal to the size of the caption vocabulary, V , and W e ∈ R d×V is a word embedding matrix, the i-th column of which is the embedding vector of the i-th word in the vocabulary; it is obtained by the Stanford GLoVe algorithm with zero mean (Pennington et al, 2017).…”

Section: System Descriptionmentioning

confidence: 99%

Tensor Product Generation Networks for Deep NLP Modeling

Huang¹,

Smolensky²,

He³

et al. 2018

Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: Hu

Self Cite

View full text Add to dashboard Cite

We present a new approach to the design of deep networks for natural language processing (NLP), based on the general technique of Tensor Product Representations (TPRs) for encoding and processing symbol structures in distributed neural networks. A network architecture -the Tensor Product Generation Network (TPGN) -is proposed which is capable in principle of carrying out TPR computation, but which uses unconstrained deep learning to design its internal representations. Instantiated in a model for image-caption generation, TPGN outperforms LSTM baselines when evaluated on the COCO dataset. The TPR-capable structure enables interpretation of internal representations and operations, which prove to contain considerable grammatical content. Our caption-generation model can be interpreted as generating sequences of grammatical categories and retrieving words by their categories from a plan encoded as a distributed representation.

show abstract

“…In this subsection, we compare our method with the state-of-the-art methods with multiple features on benchmark datasets, including SA [53], M3 [47], v2t navigator [18], Aalto [36], VideoLab [31], MA-LSTM [51], M&M-TGM [4], PickNet [8], LSTM-TSA IV [28], SibNet [23], MGSA [5], and SCN-LSTM [14], most of which fuse different features by simply concatenating.…”

Section: Performance Comparisonsmentioning

confidence: 99%

Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network

Wang

Zhang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

147

View full text Add to dashboard Cite

In this paper, we propose to guide the video caption generation with Part-of-Speech (POS) information, based on a gated fusion of multiple representations of input videos. We construct a novel gated fusion network, with one particularly designed cross-gating (CG) block, to effectively encode and fuse different types of representations, e.g., the motion and content features of an input video. One POS sequence generator relies on this fused representation to predict the global syntactic structure, which is thereafter leveraged to guide the video captioning generation and control the syntax of the generated sentence. Specifically, a gating strategy is proposed to dynamically and adaptively incorporate the global syntactic POS information into the decoder for generating each word. Experimental results on two benchmark datasets, namely MSR-VTT and MSVD, demonstrate that the proposed model can well exploit complementary information from multiple representations, resulting in improved performances. Moreover, the generated global POS information can well capture the global syntactic structure of the sentence, and thus be exploited to control the syntactic structure of the description. Such POS information not only boosts the video captioning performance but also improves the diversity of the generated captions. Our code is at: https://github.com/vsislab/ Controllable_XGating. * This work was done while Bairui Wang was a Research Intern with Tencent AI Lab. † Corresponding authors.

show abstract

Semantic Compositional Networks for Visual Captioning

Cited by 408 publications

References 41 publications

Attention on Attention for Image Captioning

Attention on Attention for Image Captioning

Tensor Product Generation Networks for Deep NLP Modeling

Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network

Contact Info

Product

Resources

About