Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Anderson, Peter; He, Xiaodong; Buehler, Chris; Teney, Damien; Johnson, Mark; Gould, Stephen J.; Zhang, Lei

doi:10.1109/cvpr.2018.00636

Cited by 3,807 publications

(4,149 citation statements)

References 52 publications

Supporting

Mentioning

4,134

Contrasting

Unclassified

Order By: Relevance

“…In this work, we propose a novel task of weaklysupervised relation prediction, with the objective of detecting relations between entities in an image purely from captions and object-level bounding box annotations without class information. Our proposed method builds upon top-down attention (Anderson et al, 2018), which generates captions and grounds word in these captions to entities in images.…”

Section: Resultsmentioning

confidence: 99%

“…Captioning using visual attention has proven to be very successful in aligning the words in a caption to their corresponding visual features, such as in Anderson et al (2018). As shown in Figure 1, we adopt the two-layer LSTM architecture in Anderson et al (2018); our end goal, however, is to associate each word with the closest object feature rather than producing a caption. The lower Attention LSTM cell takes in the words and the global image context vector (f , the mean of all features F), and its hidden state h a t acts as a query vector.…”

Section: Grounding Caption Words To Object Featuresmentioning

confidence: 99%

See 1 more Smart Citation

Learning to Relate from Captions and Bounding Boxes

Garg¹,

Moniz²,

Aviral³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

In this work, we propose a novel approach that predicts the relationships between various entities in an image in a weakly supervised manner by relying on image captions and object bounding box annotations as the sole source of supervision. Our proposed approach uses a top-down attention mechanism to align entities in captions to objects in the image, and then leverage the syntactic structure of the captions to align the relations. We use these alignments to train a relation classification network, thereby obtaining both grounded captions and dense relationships. We demonstrate the effectiveness of our model on the Visual Genome dataset by achieving a recall@50 of 15% and recall@100 of 25% on the relationships present in the image. We also show that the model successfully predicts relations that are not present in the corresponding captions.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Grounding Caption Words To Object Featuresmentioning

confidence: 99%

Learning to Relate from Captions and Bounding Boxes

Garg¹,

Moniz²,

Aviral³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…For each image I, we extract 100 region proposals and associated region features. However, different from bottom-up & top-down attention [25], We select the R ∈ R u×2×2×2048 image region feature as input. We map the dynamically changing question vector to the scaling factor and bias term of the channel feature through the fully connected layer fc and hc .…”

Section: Visual and Language Feature Preprocessmentioning

confidence: 99%

Multi-Layer Content Interaction Through Quaternion Product for Visual Question Answering

Shi

Geng

Shuang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Multi-modality fusion technologies have greatly improved the performance of neural network-based Video Description/Caption, Visual Question Answering (VQA) and Audio Visual Scene-aware Dialog (AVSD) over the recent years. Most previous approaches only explore the last layers of multiple layer feature fusion while omitting the importance of intermediate layers. To solve the issue for the intermediate layers, we propose an efficient Quaternion Block Network (QBN) to learn interaction not only for the last layer but also for all intermediate layers simultaneously. In our proposed QBN, we use the holistic text features to guide the update of visual features. In the meantime, Hamilton quaternion products can efficiently perform information flow from higher layers to lower layers for both visual and text modalities. The evaluation results show our QBN improve the performance on VQA 2.0, furthermore surpasses the approach using large scale BERT or visual BERT pre-trained models. Extensive ablation study has been carried out to examine the influence of each proposed module in this study.

show abstract

“…The encoder-decoder model first extracts high-level visual features from a CNN trained on the image classification task, and then feeds the visual features into an RNN model to predict subsequent words of a caption for a given image. In recent years, a variety of successive models [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][18][19][20] have achieved promising results. Semantic concept analysis, or attribute prediction [17,21], is a task closely related to image captioning, because attributes can be interpreted as a basis for descriptions.…”

Section: Deep Image Captioningmentioning

confidence: 99%

Image classification and captioning model considering a CAM‐based disagreement loss

Yoon

Park

et al. 2019

ETRI Journal

View full text Add to dashboard Cite

Image captioning has received significant interest in recent years, and notable results have been achieved. Most previous approaches have focused on generating visual descriptions from images, whereas a few approaches have exploited visual descriptions for image classification. This study demonstrates that a good performance can be achieved for both description generation and image classification through an end‐to‐end joint learning approach with a loss function, which encourages each task to reach a consensus. When given images and visual descriptions, the proposed model learns a multimodal intermediate embedding, which can represent both the textual and visual characteristics of an object. The performance can be improved for both tasks by sharing the multimodal embedding. Through a novel loss function based on class activation mapping, which localizes the discriminative image region of a model, we achieve a higher score when the captioning and classification model reaches a consensus on the key parts of the object. Using the proposed model, we established a substantially improved performance for each task on the UCSD Birds and Oxford Flowers datasets.

show abstract

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Cited by 3,807 publications

References 52 publications

Learning to Relate from Captions and Bounding Boxes

Learning to Relate from Captions and Bounding Boxes

Multi-Layer Content Interaction Through Quaternion Product for Visual Question Answering

Image classification and captioning model considering a CAM‐based disagreement loss

Contact Info

Product

Resources

About