2020
DOI: 10.48550/arxiv.2008.05231
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders

Abstract: Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the problem of accurate cross-media retrieval through image-sentence matching based on word-region alignments using supervision only at the global imagesentence level. In particular, we present an approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
7
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(7 citation statements)
references
References 39 publications
0
7
0
Order By: Relevance
“…Recently, there has been a shift to transformer-based [52] network architectures for both the image and caption encoder. Messina et al [38,39] introduce a transformer-based network architecture solely trained for the ICR task. Since then, several transformerbased methods have been introduced [8,22,28,31,35]; some of them combine the image and caption encoder into one unified architecture.…”
Section: Related Work 21 Image-caption Retrievalmentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, there has been a shift to transformer-based [52] network architectures for both the image and caption encoder. Messina et al [38,39] introduce a transformer-based network architecture solely trained for the ICR task. Since then, several transformerbased methods have been introduced [8,22,28,31,35]; some of them combine the image and caption encoder into one unified architecture.…”
Section: Related Work 21 Image-caption Retrievalmentioning
confidence: 99%
“…Both the images and captions are mapped into a shared latent space by two encoders, which correspond to the two modalities. These encoders are typically optimized with contrastive loss functions [4,11,12,22,27,29,34,38,39,54,55,59]. Shortcut learning.…”
Section: Introductionmentioning
confidence: 99%
“…Most existing multi-modal pre-training models, especially those with the single-tower architecture [20,36,26,39,9,11,41,22,7,14], take an assumption that there exists strong semantic correlation between the input imagetext pair. With this strong assumption, the interaction between image-text pairs can thus be modeled with crossmodal transformers.…”
Section: Introductionmentioning
confidence: 99%
“…Initially developed for solving natural language processing tasks, it found its way into the computer vision world, capturing the interest of the whole community. These Transformer-based architectures already proved their effectiveness in many image and video processing tasks [23,24,11,7,2]. The Transformer's success is mainly due to the power of the self-attention mechanism, which can relate every visual token with all the others, creating a powerful relational understanding pipeline.…”
Section: Introductionmentioning
confidence: 99%