2017
DOI: 10.1609/aaai.v31i1.11237
|View full text |Cite
|
Sign up to set email alerts
|

Text-Guided Attention Model for Image Captioning

Abstract: Visual attention plays an important role to understand images and demonstrates its effectiveness in generating natural language descriptions of images. On the other hand, recent studies show that language associated with an image can steer visual attention in the scene during our cognitive process. Inspired by this, we introduce a text-guided attention model for image captioning, which learns to drive visual attention using associated captions. For this model, we propose an exemplar-based learning approach tha… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
6
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 45 publications
(6 citation statements)
references
References 14 publications
(21 reference statements)
0
6
0
Order By: Relevance
“…Text-guided attention approach was presented by Mun et al [105]. Related sample captions, namely guidance captions, were employed to get visual attention and produce appropriate captions.…”
Section: Guided Attentionmentioning
confidence: 99%
“…Text-guided attention approach was presented by Mun et al [105]. Related sample captions, namely guidance captions, were employed to get visual attention and produce appropriate captions.…”
Section: Guided Attentionmentioning
confidence: 99%
“…In the VQA task, self-attention is extensively used to model word-to-word relationships for questions and region-to-region relationships for images in the VQA task. Questionguided attention on image regions or video frames is generally explored for visual question answering [27,36,45,46], video question answering [47][48][49], image captioning [6], etc. In order to capture more intensive correlation between cross-modal nodes, co-attention-based approaches [25,29,36,37] use bi-direction attention to learn the relationships between wordregion pairs.…”
Section: Attention Mechanismsmentioning
confidence: 99%
“…At present, multi-modal learning has bridged the gap between visual and language and has been widely concerned [1][2][3][4][5]. Remarkable progress has been made in many multi-modal learning tasks, e.g., image captioning [6][7][8][9], video captioning [10][11][12], cross-modal retrieval [13][14][15][16][17][18][19][20][21][22], and visual question answering(VQA) [7,[23][24][25][26][27][28][29][30][31].…”
Section: Introductionmentioning
confidence: 99%
“…State-of-the-art and limitations: The pursuit of multimodalinput-based abstractive text summarization can be related to various other fields, such as image and video captioning [22,34,39,48,49], video story generation [16], video title generation [57], and multimodal sentence summarization [28]. However, these works generally produce summaries based on either images or short videos, and the target summaries are easier to predict due to the limited vocabulary diversity.…”
Section: Introductionmentioning
confidence: 99%