Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders

Messina, Nicola; Amato, Giuseppe; Esuli, Andrea; Falchi, Fabrizio; Gennaro, Claudio; Marchand-Maillet, Stéphane

doi:10.48550/arxiv.2008.05231

Cited by 5 publications

(7 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, there has been a shift to transformer-based [52] network architectures for both the image and caption encoder. Messina et al [38,39] introduce a transformer-based network architecture solely trained for the ICR task. Since then, several transformerbased methods have been introduced [8,22,28,31,35]; some of them combine the image and caption encoder into one unified architecture.…”

Section: Related Work 21 Image-caption Retrievalmentioning

confidence: 99%

See 1 more Smart Citation

Keep the Caption Information: Preventing Shortcut Learning in Contrastive Image-Caption Retrieval

Bleeker¹,

Yates²,

Rijke³

2022

Preprint

View full text Add to dashboard Cite

To train image-caption retrieval (ICR) methods, contrastive loss functions are a common choice for optimization functions. Unfortunately, contrastive ICR methods are vulnerable to learning shortcuts: decision rules that perform well on the training data but fail to transfer to other testing conditions. We introduce an approach to reduce shortcut feature representations for the ICR task: latent target decoding (LTD). We add an additional decoder to the learning framework to reconstruct the input caption, which prevents the image and caption encoder from learning shortcut features. Instead of reconstructing input captions in the input space, we decode the semantics of the caption in a latent space. We implement the LTD objective as an optimization constraint, to ensure that the reconstruction loss is below a threshold value while primarily optimizing for the contrastive loss. Importantly, LTD does not depend on additional training data or expensive (hard) negative mining strategies. Our experiments show that, unlike reconstructing the input caption, LTD reduces shortcut learning and improves generalizability by obtaining higher recall@k and r-precision scores. Additionally, we show that the evaluation scores benefit from implementing LTD as an optimization constraint instead of a dual loss.

show abstract

Section: Related Work 21 Image-caption Retrievalmentioning

confidence: 99%

“…Both the images and captions are mapped into a shared latent space by two encoders, which correspond to the two modalities. These encoders are typically optimized with contrastive loss functions [4,11,12,22,27,29,34,38,39,54,55,59]. Shortcut learning.…”

Section: Introductionmentioning

confidence: 99%

Keep the Caption Information: Preventing Shortcut Learning in Contrastive Image-Caption Retrieval

Bleeker¹,

Yates²,

Rijke³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Most existing multi-modal pre-training models, especially those with the single-tower architecture [20,36,26,39,9,11,41,22,7,14], take an assumption that there exists strong semantic correlation between the input imagetext pair. With this strong assumption, the interaction between image-text pairs can thus be modeled with crossmodal transformers.…”

Section: Introductionmentioning

confidence: 99%

WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

Huo¹,

Zhang²,

Liu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pretraining, which is the focus of the Chinese project 'Wen-Lan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a twotower pre-training model called BriVL within the crossmodal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source imagetext dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.

show abstract

“…Initially developed for solving natural language processing tasks, it found its way into the computer vision world, capturing the interest of the whole community. These Transformer-based architectures already proved their effectiveness in many image and video processing tasks [23,24,11,7,2]. The Transformer's success is mainly due to the power of the self-attention mechanism, which can relate every visual token with all the others, creating a powerful relational understanding pipeline.…”

Section: Introductionmentioning

confidence: 99%

Recurrent Vision Transformer for Solving Visual Reasoning Problems

Messina¹,

Amato²,

Carrara³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Although convolutional neural networks (CNNs) showed remarkable results in many vision tasks, they are still strained by simple yet challenging visual reasoning problems. Inspired by the recent success of the Transformer network in computer vision, in this paper, we introduce the Recurrent Vision Transformer (RViT) model. Thanks to the impact of recurrent connections and spatial attention in reasoning tasks, this network achieves competitive results on the same-different visual reasoning problems from the SVRT dataset. The weight-sharing both in spatial and depth dimensions regularizes the model, allowing it to learn using far fewer free parameters, using only 28k training samples. A comprehensive ablation study confirms the importance of a hybrid CNN + Transformer architecture and the role of the feedback connections, which iteratively refine the internal representation until a stable prediction is obtained. In the end, this study can lay the basis for a deeper understanding of the role of attention and recurrent connections for solving visual abstract reasoning tasks. The code for reproducing our results is publicly available here: [hidden for double-blind review].

show abstract

Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders

Cited by 5 publications

References 39 publications

Keep the Caption Information: Preventing Shortcut Learning in Contrastive Image-Caption Retrieval

Keep the Caption Information: Preventing Shortcut Learning in Contrastive Image-Caption Retrieval

WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

Recurrent Vision Transformer for Solving Visual Reasoning Problems

Contact Info

Product

Resources

About