Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Goyal, Yash; Khot, Tejas; Agrawal, Aishwarya; Summers-Stay, Douglas; Batra, Dhruv; Parikh, Devi

doi:10.1007/s11263-018-1116-0

Cited by 419 publications

(750 citation statements)

References 49 publications

Supporting

Mentioning

742

Contrasting

Order By: Relevance

“…(3) We execute extensive ablation studies for each component of QBN and achieve state-of-the-art performance on VQA v2.0 [6]. Surprisingly, our proposed QBN can even surpass BERT retrained models like VilBERT.…”

Section: Mcan Interactionmentioning

confidence: 99%

Multi-Layer Content Interaction Through Quaternion Product for Visual Question Answering

Shi

Geng

Shuang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Multi-modality fusion technologies have greatly improved the performance of neural network-based Video Description/Caption, Visual Question Answering (VQA) and Audio Visual Scene-aware Dialog (AVSD) over the recent years. Most previous approaches only explore the last layers of multiple layer feature fusion while omitting the importance of intermediate layers. To solve the issue for the intermediate layers, we propose an efficient Quaternion Block Network (QBN) to learn interaction not only for the last layer but also for all intermediate layers simultaneously. In our proposed QBN, we use the holistic text features to guide the update of visual features. In the meantime, Hamilton quaternion products can efficiently perform information flow from higher layers to lower layers for both visual and text modalities. The evaluation results show our QBN improve the performance on VQA 2.0, furthermore surpasses the approach using large scale BERT or visual BERT pre-trained models. Extensive ablation study has been carried out to examine the influence of each proposed module in this study.

show abstract

Section: Mcan Interactionmentioning

confidence: 99%

Multi-Layer Content Interaction Through Quaternion Product for Visual Question Answering

Shi

Geng

Shuang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…VQA consisting of open-ended questions and both real and abstract scenes [44], [234]. A VQA Challenge based on these data sets is held annually as a CVPR workshop since 2016.…”

Section: Visual Question Answering 1) Task Definitionmentioning

confidence: 99%

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Zhang

Yang

et al. 2020

IEEE J. Sel. Top. Signal Process.

257

View full text Add to dashboard Cite

Deep learning has revolutionized speech recognition, image recognition, and natural language processing since 2010, each involving a single modality in the input signal. However, many applications in artificial intelligence involve more than one modality. It is therefore of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, a technical review of the models and learning methods for multimodal intelligence is provided. The main focus is the combination of vision and natural language, which has become an important area in both computer vision and natural language processing research communities.This review provides a comprehensive analysis of recent work on multimodal deep learning from three new angles -learning multimodal representations, the fusion of multimodal signals at various levels, and multimodal applications. On multimodal representation learning, we review the key concept of embedding, which unifies the multimodal signals into the same vector space and thus enables cross-modality signal processing. We also review the properties of the many types of embedding constructed and learned for general downstream tasks. On multimodal fusion, this review focuses on special architectures for the integration of the representation of unimodal signals for a particular task. On applications, selected areas of a broad interest in current literature are covered, including caption generation, text-to-image generation, and visual question answering. We believe this review can facilitate future studies in the emerging field of multimodal intelligence for the community.

show abstract

“…Specifically, VQA takes an image and a corresponding natural language question as input and outputs the answer. It is a classification problem in which candidate answers are restricted to the most common answers appearing in the dataset and requires deep analysis and understanding of images and questions such as image recognition and object localization [16,27,38,42]. Current models can be classified into three main categories: early fusion models, later fusion models, and external knowledge-based models.…”

Section: Related Workmentioning

confidence: 99%

Learning to Respond with Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog

Gao

Chen

Liu

et al. 2020

Proceedings of the Web Conference 2020

View full text Add to dashboard Cite

Stickers with vivid and engaging expressions are becoming increasingly popular in online messaging apps, and some works are dedicated to automatically select sticker response by matching text labels of stickers with previous utterances. However, due to their large quantities, it is impractical to require text labels for the all stickers. Hence, in this paper, we propose to recommend an appropriate sticker to user based on multi-turn dialog context history without any external labels. Two main challenges are confronted in this task. One is to learn semantic meaning of stickers without corresponding text labels. Another challenge is to jointly model the candidate sticker with the multi-turn dialog context. To tackle these challenges, we propose a sticker response selector (SRS) model. Specifically, SRS first employs a convolutional based sticker image encoder and a self-attention based multi-turn dialog encoder to obtain the representation of stickers and utterances. Next, deep interaction network is proposed to conduct deep matching between the sticker with each utterance in the dialog history. SRS then learns the short-term and long-term dependency between all interaction results by a fusion network to output the the final matching score. To evaluate our proposed method, we collect a large-scale realworld dialog dataset with stickers from one of the most popular online chatting platform. Extensive experiments conducted on this dataset show that our model achieves the state-of-the-art performance for all commonly-used metrics. Experiments also verify the * Equal contribution. Ordering is decided by a coin flip. Work performed during an internship at IIAI. † WICT is the abbreviation of Wangxuan Institute of Computer Technology. effectiveness of each component of SRS. To facilitate further research in sticker selection field, we release this dataset of 340K multi-turn dialog and sticker pairs 1 .

show abstract

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Cited by 419 publications

References 49 publications

Multi-Layer Content Interaction Through Quaternion Product for Visual Question Answering

Multi-Layer Content Interaction Through Quaternion Product for Visual Question Answering

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Learning to Respond with Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog

Contact Info

Product

Resources

About