2019
DOI: 10.1609/aaai.v33i01.33018102
|View full text |Cite
|
Sign up to set email alerts
|

BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection

Abstract: Multimodal representation learning is gaining more and more interest within the deep learning community. While bilinear models provide an interesting framework to find subtle combination of modalities, their number of parameters grows quadratically with the input dimensions, making their practical implementation within classical deep learning pipelines challenging. In this paper, we introduce BLOCK, a new multimodal fusion based on the block-superdiagonal tensor decomposition. It leverages the notion of block-… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
123
0

Year Published

2020
2020
2020
2020

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 172 publications
(125 citation statements)
references
References 26 publications
(37 reference statements)
1
123
0
Order By: Relevance
“…Comparing to MUTAN, MCB can be seen as MUTAN with fixed diagonal input factor matrices and a sparse fixed core tensor, while MLB is MUTAN with the core tensor set to identity. Recently, BLOCK, a block superdiagonal fusion framework is proposed to use blockterm decomposition [160] to compute bilinear pooling [161]. BLOCK generalizes MUTAN as a summation of multiple MUTAN models to provide a richer modeling of interactions between modalities.…”
Section: Bilinear Pooling-based Fusionmentioning
confidence: 99%
“…Comparing to MUTAN, MCB can be seen as MUTAN with fixed diagonal input factor matrices and a sparse fixed core tensor, while MLB is MUTAN with the core tensor set to identity. Recently, BLOCK, a block superdiagonal fusion framework is proposed to use blockterm decomposition [160] to compute bilinear pooling [161]. BLOCK generalizes MUTAN as a summation of multiple MUTAN models to provide a richer modeling of interactions between modalities.…”
Section: Bilinear Pooling-based Fusionmentioning
confidence: 99%
“…1 and Eq. 2, the node representations of each layer of graphs are updated following the message-passing framework [Gilmer et al, 2017]. We gather the neighborhood information and update the representation of v i as:…”
Section: Intra-modal Knowledge Selectionmentioning
confidence: 99%
“…Equipped with the capacities of grounding, reasoning and translating, a VQA agent is expected to answer a question in natural language based on an image. Recent works [Cadene et al, 2019; Figure 1: An illustration of our motivation. We represent an image by multi-layer graphs and cross-modal knowledge reasoning is conducted on the graphs to infer the optimal answer.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…However, these simple phrases cannot represent such complex relationships in an image. General visual relationship detection has been paid more attention [18][19][20], where the subject and object can be any objects in the image and their relationships cover a wide range of relationship types. These methods generally adopt a neural network to classify the relationship by using bounding boxes and semantic features of subject and object as the input.…”
Section: Related Workmentioning
confidence: 99%