Attention on Attention for Image Captioning

Huang, Ling-Qiao; Wang, Wenmin; Chen, Jie; Wei, Xiaoyong

doi:10.1109/iccv.2019.00473

Cited by 737 publications

(589 citation statements)

References 39 publications

Supporting

Mentioning

582

Contrasting

Unclassified

Order By: Relevance

“…The Up-Down [7] method proposed a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of salient image regions. The AoANet [9] method introduced an extension of the attention operator in which the final attended information is weighted by a gate. Our work is developed on top of these methods.…”

Section: Quantitative Resultsmentioning

confidence: 99%

“…AoANet model: This model is proposed in AoANet [9] where the results of self-attention and initial query are concatenated together and fed into two linear layers to obtain information vectors by multiplication with a sigmoid gate. The final result is used as an alternative to the original selfattention operation.…”

Section: Baseline Methodsmentioning

confidence: 99%

“…Li et al [26] devise a unique attention mechanism to exploit the visual and semantic information simultaneously in the transformer framework. In addition, Huang et al [9] introduce an extension of the attention operator in which the final attended information is concatenated with the hidden state of LSTM and the attention context. For enhancing the connection between encoder and decoder, Cornia et al [10] propose a meshed transformer to learn a hierarchical representation of the relationships between image regions and use a mesh-like connectivity at the decoding stage to exploit hierarchical image features.…”

Section: Related Workmentioning

confidence: 99%

“…A captioning model normally uses two LSTM layers to produce words where the first LSTM layer is used as an attention model and the second LSTM layer as a language model shown as Figure 4(a). Followed by [9], we apply multi-head self-attention to decoder. The input vector to the first LSTM layer at each time step consists of mean pooled image feature, the previous output of the language LSTM and words:…”

Section: Decoder With Sparse Multi-head Self-attentionmentioning

confidence: 99%

“…In recent years, the great success of Transformer network [8] has attracted many researchers to explore towards the cutting edge techniques of self-attention for image captioning. The outstanding performance of selfattention comes from the ability to explore the relationships between the detected entities in encoder and model the correlation between image regions and hidden states as attention mechanism [9] [10].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

A Sparse Transformer-Based Approach for Image Captioning

Zhou

Chen

et al. 2020

IEEE Access

View full text Add to dashboard Cite

Image Captioning is the task of providing a natural language description for an image. It has caught significant amounts of attention from both computer vision and natural language processing communities. Most image captioning models adopt deep encoder-decoder architectures to achieve state-of-the-art performances. However, it is difficult to model knowledge on relationships between input image region pairs in the encoder. Furthermore, the word in the decoder hardly knows the correlation to specific image regions. In this paper, a novel deep encoder-decoder model is proposed for image captioning which is developed on sparse Transformer framework. The encoder adopts a multi-level representation of image features based on self-attention to exploit low-level and high-level features, naturally the correlations between image region pairs are adequately modeled as self-attention operation can be seen as a way of encoding pairwise relationships. The decoder improves the concentration of multi-head self-attention on the global context by explicitly selecting the most relevant segments at each row of the attention matrix. It can help the model focus on the more contributing image regions and generate more accurate words in the context. Experiments demonstrate that our model outperforms previous methods and achieves higher performance on MSCOCO and Flickr30k datasets. Our code is available at https://github.com/2014gaokao/ImageCaptioning. INDEX TERMS Image captioning, self-attention, explict sparse, local adaptive threshold.

show abstract

Section: Quantitative Resultsmentioning

confidence: 99%

Section: Baseline Methodsmentioning

confidence: 99%