BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection

Ben-younes, Hedi; Cadène, Rémi; Thome, Nicolas; Cord, Matthieu

doi:10.1609/aaai.v33i01.33018102

Cited by 172 publications

(125 citation statements)

References 26 publications

(37 reference statements)

Supporting

Mentioning

123

Contrasting

Order By: Relevance

“…Comparing to MUTAN, MCB can be seen as MUTAN with fixed diagonal input factor matrices and a sparse fixed core tensor, while MLB is MUTAN with the core tensor set to identity. Recently, BLOCK, a block superdiagonal fusion framework is proposed to use blockterm decomposition [160] to compute bilinear pooling [161]. BLOCK generalizes MUTAN as a summation of multiple MUTAN models to provide a richer modeling of interactions between modalities.…”

Section: Bilinear Pooling-based Fusionmentioning

confidence: 99%

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Zhang

Yang

et al. 2020

IEEE J. Sel. Top. Signal Process.

269

View full text Add to dashboard Cite

Deep learning has revolutionized speech recognition, image recognition, and natural language processing since 2010, each involving a single modality in the input signal. However, many applications in artificial intelligence involve more than one modality. It is therefore of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, a technical review of the models and learning methods for multimodal intelligence is provided. The main focus is the combination of vision and natural language, which has become an important area in both computer vision and natural language processing research communities.This review provides a comprehensive analysis of recent work on multimodal deep learning from three new angles -learning multimodal representations, the fusion of multimodal signals at various levels, and multimodal applications. On multimodal representation learning, we review the key concept of embedding, which unifies the multimodal signals into the same vector space and thus enables cross-modality signal processing. We also review the properties of the many types of embedding constructed and learned for general downstream tasks. On multimodal fusion, this review focuses on special architectures for the integration of the representation of unimodal signals for a particular task. On applications, selected areas of a broad interest in current literature are covered, including caption generation, text-to-image generation, and visual question answering. We believe this review can facilitate future studies in the emerging field of multimodal intelligence for the community.

show abstract

Section: Bilinear Pooling-based Fusionmentioning

confidence: 99%

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Zhang

Yang

et al. 2020

IEEE J. Sel. Top. Signal Process.

269

View full text Add to dashboard Cite

show abstract

“…1 and Eq. 2, the node representations of each layer of graphs are updated following the message-passing framework [Gilmer et al, 2017]. We gather the neighborhood information and update the representation of v i as:…”

Section: Intra-modal Knowledge Selectionmentioning

confidence: 99%

“…Equipped with the capacities of grounding, reasoning and translating, a VQA agent is expected to answer a question in natural language based on an image. Recent works [Cadene et al, 2019; Figure 1: An illustration of our motivation. We represent an image by multi-layer graphs and cross-modal knowledge reasoning is conducted on the graphs to infer the optimal answer.…”

Section: Introductionmentioning

confidence: 99%

“…We represent an image by multi-layer graphs and cross-modal knowledge reasoning is conducted on the graphs to infer the optimal answer. Li et al, 2019b;Ben-Younes et al, 2019] have achieved great success in the VQA problems that are answerable by solely referring to the visible content of the image. However, such kinds of models are incapable of answering questions which require external knowledge beyond what is in the image.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering

Zhu

Wang

et al. 2020

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence

View full text Add to dashboard Cite

Fact-based Visual Question Answering (FVQA) requires external knowledge beyond the visible content to answer questions about an image. This ability is challenging but indispensable to achieve general VQA. One limitation of existing FVQA solutions is that they jointly embed all kinds of information without fine-grained selection, which introduces unexpected noises for reasoning the final answer. How to capture the question-oriented and information-complementary evidence remains a key challenge to solve the problem. In this paper, we depict an image by a multi-modal heterogeneous graph, which contains multiple layers of information corresponding to the visual, semantic and factual features. On top of the multi-layer graph representations, we propose a modality-aware heterogeneous graph convolutional network to capture evidence from different layers that is most relevant to the given question. Specifically, the intra-modal graph convolution selects evidence from each modality and cross-modal graph convolution aggregates relevant information across different graph layers. By stacking this process multiple times, our model performs iterative reasoning across three modalities and predicts the optimal answer by analyzing all question-oriented evidence. We achieve a new state-of-the-art performance on the FVQA task and demonstrate the effectiveness and interpretability of our model with extensive experiments.

show abstract

“…However, these simple phrases cannot represent such complex relationships in an image. General visual relationship detection has been paid more attention [18][19][20], where the subject and object can be any objects in the image and their relationships cover a wide range of relationship types. These methods generally adopt a neural network to classify the relationship by using bounding boxes and semantic features of subject and object as the input.…”

Section: Related Workmentioning

confidence: 99%

Attentive Gated Graph Neural Network for Image Scene Graph Generation

Tang

Zhang

et al. 2020

Symmetry

View full text Add to dashboard Cite

Image scene graph is a semantic structural representation which can not only show what objects are in the image, but also infer the relationships and interactions among them. Despite the recent success in object detection using deep neural networks, automatically recognizing social relations of objects in images remains a challenging task due to the significant gap between the domains of visual content and social relation. In this work, we translate the scene graph into an Attentive Gated Graph Neural Network which can propagate a message by visual relationship embedding. More specifically, nodes in gated neural networks can represent objects in the image, and edges can be regarded as relationships among objects. In this network, an attention mechanism is applied to measure the strength of the relationship between objects. It can increase the accuracy of object classification and reduce the complexity of relationship classification. Extensive experiments on the widely adopted Visual Genome Dataset show the effectiveness of the proposed method.

show abstract

BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection

Cited by 172 publications

References 26 publications

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering

Attentive Gated Graph Neural Network for Image Scene Graph Generation

Contact Info

Product

Resources

About