2019
DOI: 10.1007/978-3-030-20870-7_2
|View full text |Cite
|
Sign up to set email alerts
|

Gated Hierarchical Attention for Image Captioning

Abstract: Attention modules connecting encoder and decoders have been widely applied in the field of object recognition, image captioning, visual question answering and neural machine translation, and significantly improves the performance. In this paper, we propose a bottomup gated hierarchical attention (GHA) mechanism for image captioning. Our proposed model employs a CNN as the decoder which is able to learn different concepts at different layers, and apparently, different concepts correspond to different areas of a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
6
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
3

Relationship

2
6

Authors

Journals

citations
Cited by 13 publications
(6 citation statements)
references
References 39 publications
0
6
0
Order By: Relevance
“…In [19], a sentinel gate decides whether the visual feature or the semantic feature should be used for prediction. Instead of employing LSTM decoders for sentences, [2,31,32] apply convolutional decoders, which achieves faster training process and comparable results.…”
Section: Related Workmentioning
confidence: 99%
“…In [19], a sentinel gate decides whether the visual feature or the semantic feature should be used for prediction. Instead of employing LSTM decoders for sentences, [2,31,32] apply convolutional decoders, which achieves faster training process and comparable results.…”
Section: Related Workmentioning
confidence: 99%
“…For instance, Anderson et al [1] proposes bottom-up features, which are extracted by a pre-trained Faster-RCNN [23] and a top-down attention LSTM, where an object is attended in each step when predicting captions. Apart from using RNNs as the language decoder, Aneja et al [2], Wang and Chan [33,34] utilize CNNs since LSTMs cannot be trained in a parallel manner. Cornia et al [6], Li et al [15] adopt transformer-based networks with multi-head attention to generate captions, which mitigate the long-term dependency problem in LSTMs and significantly improves the performance of image captioning.…”
Section: Related Workmentioning
confidence: 99%
“…Earlier, Xu et al [20] have introduced a spatial attention model using image feature maps to generate image captions extended with a channel-wise attention module in [21]. Later, [22] introduces a gated hierarchical attention module by merging low-level features with high-level features. Xu et al [26] propose an attention-gated LSTM model where the output gate incorporates visual attention and forwards to the cell state of LSTM.…”
Section: Attention-based Image Captioningmentioning
confidence: 99%