2019 IEEE International Conference on Image Processing (ICIP) 2019
DOI: 10.1109/icip.2019.8803106
|View full text |Cite
|
Sign up to set email alerts
|

Deeply Supervised Multimodal Attentional Translation Embeddings for Visual Relationship Detection

Abstract: Detecting visual relationships, i.e. triplets, is a challenging Scene Understanding task approached in the past via linguistic priors or spatial information in a single feature branch. We introduce a new deeply supervised two-branch architecture, the Multimodal Attentional Translation Embeddings, where the visual features of each branch are driven by a multimodal attentional mechanism that exploits spatio-linguistic similarities in a lowdimensional space. We present a variety of ex… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 12 publications
(5 citation statements)
references
References 23 publications
(94 reference statements)
0
5
0
Order By: Relevance
“…The Multimodal Attentional Translation Embeddings (MATransE) model built upon VTransE [71] learns a projection of <S, P, O> into a score space where S + P ≈ O, by guiding the features' projection with attention to satisfy:…”
Section: Visual Translation Embeddingmentioning
confidence: 99%
“…The Multimodal Attentional Translation Embeddings (MATransE) model built upon VTransE [71] learns a projection of <S, P, O> into a score space where S + P ≈ O, by guiding the features' projection with attention to satisfy:…”
Section: Visual Translation Embeddingmentioning
confidence: 99%
“…Long-tailed data distribution has been a key challenge in visual recognition [41], and it has been addressed in the recent literature on SGG [42]. In order to tackle this problem, various approaches have been proposed, such as data resampling [43], [44], [45], [46], de-biasing [16], [47], [48], [49], [50], and loss modification [51], [52], [53], [54], [55], [56], [57]. De-biasing methods require pre-trained biased models for initialization and then finetune the model.…”
Section: Long-tailed Distributionsmentioning
confidence: 99%
“…However, simple positional embedding implicitly captures spatial information with position coordinates as inputs to networks, which is unable to capture the explicit spatial configuration in the feature space. [13] introduces binary masks which explicitly specify the subject and object positions, implicitly specifying the spatial configuration by concatenating with union features. With the recent advances in graph neural networks, the relevant positional information can be captured by message passing between instances [59].…”
Section: Spatial Informationmentioning
confidence: 99%
“…Different from the standard positional embedding [57] which implicitly utilizes the spatial information with the absolute positions as an input, SMD explicitly learns a structured embedding. Compared to [13] which concatenates binary mask as position feature, SMD explicitly imposes the spatial structure in the embedding vector and gives better spatially-aware embeddings. We empirically show SMD outperforms these variants in our experiments.…”
Section: Spatial Mask Decodermentioning
confidence: 99%
See 1 more Smart Citation