Similarity Reasoning and Filtration for Image-Text Matching

Diao, Haiwen; Zhang, Ying; Ma, Lin; Lu, Huchuan

doi:10.1609/aaai.v35i2.16209

Cited by 200 publications

(89 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Text-and-Image Matching The cosine similarity based attention alignment proposed by SCAN (Lee et al, 2018) is most widely used in Text-and-Image Matching , (Chen and Luo, 2020), (Diao et al, 2021), (Dong et al, 2021), . They applied text-to-image (t2i) and image-to-text(i2t) attention in two separate variants to filter the cross-modal relevant representations for later image-sentence matching.…”

Section: Related Workmentioning

confidence: 99%

Understanding Attention for Vision-and-Language Tasks

Feiqi¹,

Han²,

Long³

et al. 2022

Preprint

View full text Add to dashboard Cite

Attention mechanism has been used as an important component across Vision-and-Language(VL) tasks in order to bridge the semantic gap between visual and textual features. While attention has been widely used in VL tasks, it has not been examined the capability of different attention alignment calculation in bridging the semantic gap between visual and textual clues. In this research, we conduct a comprehensive analysis on understanding the role of attention alignment by looking into the attention score calculation methods and check how it actually represents the visual region's and textual token's significance for the global assessment. We also analyse the conditions which attention score calculation mechanism would be more (or less) interpretable, and which may impact the model performance on three different VL tasks, including visual question answering, text-to-image generation, text-and-image matching (both sentence and image retrieval). Our analysis is the first of its kind and provides useful insights of the importance of each attention alignment score calculation when applied at the training phase of VL tasks, commonly ignored in attentionbased cross modal models, and/or pretrained models.

show abstract

Section: Related Workmentioning

confidence: 99%

Understanding Attention for Vision-and-Language Tasks

Feiqi¹,

Han²,

Long³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In order to justify the superiority of our unified loss over the state-of-the-art image-text retrieval models, we conduct experiments on VSE++, BFAN [27] and SGRAF [11] by only replacing the loss functions.…”

Section: Image-text Retrieval Without Pre-trainingmentioning

confidence: 99%

Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval

Li¹,

Guo²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

There are two popular loss functions used for visionlanguage retrieval, i.e., triplet loss and contrastive learning loss, both of them essentially minimize the difference between the similarities of negative pairs and positive pairs. More specifically, Triplet loss with Hard Negative mining (Triplet-HN), which is widely used in existing retrieval models to improve the discriminative ability, is easy to fall into local minima in training. On the other hand, Vision-Language Contrastive learning loss (VLC), which is widely used in the vision-language pre-training, has been shown to achieve significant performance gains on vision-language retrieval, but the performance of fine-tuning with VLC on small datasets is not satisfactory. This paper proposes a unified loss of pair similarity optimization for visionlanguage retrieval, providing a powerful tool for understanding existing loss functions. Our unified loss includes the hard sample mining strategy of VLC and introduces the margin used by the triplet loss for better similarity separation. It is shown that both Triplet-HN and VLC are special forms of our unified loss. Compared with the Triplet-HN, our unified loss has a fast convergence speed. Compared with the VLC, our unified loss is more discriminative and can provide better generalization in downstream fine-tuning tasks. Experiments on image-text and video-text retrieval benchmarks show that our unified loss can significantly improve the performance of the state-of-the-art retrieval models.

show abstract

“…Despite the difference in image types, these comparisons can be achieved by analyzing the human-detectable details in the clothes, such as cloth category, color, pattern, prints on the clothes, and so on. Most current retrieval solutions [ [5], [6], [7], [8], [9], [10]] incorporate deep learning models that convert actual images into vector representation so that the query image's embedding can be compared against all the images' embeddings from the list, and the closest one can be returned. For that, triplet loss is the most widely used comparative loss technique.…”

Section: Introductionmentioning

confidence: 99%

Fashion Image Retrieval based on Parallel Branched Attention Network

Buddhacharya¹,

Adhikari²,

Lamichhane³

2022

IJACSA

View full text Add to dashboard Cite

With the increase in vision-associated applications in e-commerce, image retrieval has become an emerging application in computer vision. Matching the exact user clothes from the database images is challenging due to noisy background, wide variation in orientation and lighting conditions, shape deformations, and the variation in the quality of the images between query and refined shop images. Most existing solutions tend to miss out on either incorporating low-level features or doing it effectively within their networks. Addressing the issue, we propose an attention-based multiscale deep Convolutional Neural Network (CNN) architecture called Parallel Attention ResNet (PAResNet50). It includes other supplementary branches with attention layers to extract low-level discriminative features and uses both high-level and low-level features for the notion of visual similarity. The attention layer focuses on the local discriminative regions and ignores the noisy background. Image retrieval output shows that our approach is robust to different lighting conditions. Experimental results on two public datasets show that our approach effectively locates the important region and significantly improves retrieval accuracy over simple network architectures without attention.

show abstract

Similarity Reasoning and Filtration for Image-Text Matching

Cited by 200 publications

References 40 publications

Understanding Attention for Vision-and-Language Tasks

Understanding Attention for Vision-and-Language Tasks

Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval

Fashion Image Retrieval based on Parallel Branched Attention Network

Contact Info

Product

Resources

About