Adaptive Offline Quintuplet Loss for Image-Text Matching

Chen, Tianlang; Deng, Jiajun; Luo, Jiebo

doi:10.1007/978-3-030-58601-0_33

Cited by 56 publications

(30 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We choose the latest work in past two years as baseline methods for comparison with Global Relation-aware Attention Network (GRAN). Including SCAN [19], ACMNet [5], CASC [6], DP-RNN [9], MMCA [39], CAAN [44], IMRAM [7], AAMEL [38], SMAN [16], M3A-Net [15] which use cross-related methods; SGM [35], Guo et al [12] which use GCN [18]; Polynomial Loss [37], AMF [26], Chen et al [8] which introduce new loss; and TERAN [24] which uses transformer. @ ( = 1, 5, 10) is adopted to evaluate the cross-modal retrieval performance of all methods.…”

Section: Baseline Methods and Evaluation Metricsmentioning

confidence: 99%

Global Relation-Aware Attention Network for Image-Text Retrieval

Cao

Qian

Zhang

et al. 2021

Proceedings of the 2021 International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

The cross-modal image-text retrieval has attracted extensive attention in recent years, which contributes to the development of search engine. Fine-grained features and cross-attention have been widely used in past researches to reach the goal of cross-modal image-text matching. Although cross-related methods have achieved remarkable results, the features must be encoded again in evaluation phase due to the interaction of the two modalities, which is unsuitable for actual scenarios of search engine development. In addition, the aggregated feature does not contain sufficient semantics since it is merely obtained by simple mean pooling. Furthermore, connecting weights of self-attention blocks are target position invariant, which lacks the expected adaptability. To tackle these limitations, in this paper, we propose a novel Global Relation-aware Attention Network (GRAN) for image-text retrieval by designing Global Attention Module (GAM) and Relation-aware Attention Module (RAM) which play an important role in modeling the global feature and the relationships of local fragments. Firstly, we propose Global Attention Module (GAM) followed the fine-grained features to obtain meaningful global feature. Secondly, we use several stacked transformer encoders to further encode features separately. Finally, we propose Relation-aware Attention Module (RAM) to generate a vector which represents the relation information to infer the attention intensity of pairwise fragments. The local features, the global feature, and their relations are considered jointly to conduct an efficient image-text retrieval. Extensive experiments are conducted on the benchmark datasets of Flickr30K and MSCOCO, demonstrating the superiority of our method. CCS CONCEPTS• Information systems → Multimedia and multimodal retrieval.

show abstract

Section: Baseline Methods and Evaluation Metricsmentioning

confidence: 99%

Global Relation-Aware Attention Network for Image-Text Retrieval

Cao

Qian

Zhang

et al. 2021

Proceedings of the 2021 International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

show abstract

“…Our model modifies the conventional triplet [15] network architecture with multi-instance inputs and defines a custom loss function. There have been previous efforts in using generic n-tuple inputs [24,25] and a variety of loss functions such as contrastive loss [26], triplet-center loss [27], lifted loss [28], histogram loss [29], multi-similarity loss [30] and circle loss [31] have been explored before. While we share with these models the general intention of designing an objective function that assigns larger weights to informative inputs, our work differs with its focus on introducing different notions of similarity rather than just improving pair selection strategy.…”

Section: Related Workmentioning

confidence: 99%

Structure and Semantics Preserving Document Representations

Raman¹,

Shah²,

Veloso³

2022

Preprint

View full text Add to dashboard Cite

Retrieving relevant documents from a corpus is typically based on the semantic similarity between the document content and query text. The inclusion of structural relationship between documents can benefit the retrieval mechanism by addressing semantic gaps. However, incorporating these relationships requires tractable mechanisms that balance structure with semantics and take advantage of the prevalent pre-train/fine-tune paradigm. We propose here a holistic approach to learning document representations by integrating intra-document content with inter-document relations. Our deep metric learning solution analyzes the complex neighborhood structure in the relationship network to efficiently sample similar/dissimilar document pairs and defines a novel quintuplet loss function that simultaneously encourages document pairs that are semantically relevant to be closer and structurally unrelated to be far apart in the representation space. Furthermore, the separation margins between the documents are varied flexibly to encode the heterogeneity in relationship strengths. The model is fully fine-tunable and natively supports query projection during inference. We demonstrate that it outperforms competing methods on multiple datasets for document retrieval tasks.

show abstract

“…We start by formally introducing the standard contrastive learning framework commonly used in previous works (Lee et al, 2018;Chen et al, 2020b)…”

Section: Contrastive Learningmentioning

confidence: 99%

“…L i−t corresponds to image-to-text retrieval, while L t−i corresponds to text-to-image retrieval (or image search). Common negative sampling strategy includes selecting all the negatives (Huang et al, 2017), selecting hard negatives of highest similarity scores in the mini-batch (Faghri et al, 2018), and selecting hard negatives from the whole training data (Chen et al, 2020b). Minimizing the marginbased triplet loss will make positive image-text pairs closer to each other than other negative samples in the joint embedding space.…”

Section: Contrastive Learningmentioning

confidence: 99%

Are Gender-Neutral Queries Really Gender-Neutral? Mitigating Gender Bias in Image Search

Wang¹,

Liu²,

Wang³

2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Internet search affects people's cognition of the world, so mitigating biases in search results and learning fair models is imperative for social good. We study a unique gender bias in image search in this work: the search images are often gender-imbalanced for genderneutral natural language queries. We diagnose two typical image search models, the specialized model trained on in-domain datasets and the generalized representation model pretrained on massive image and text data across the internet. Both models suffer from severe gender bias. Therefore, we introduce two novel debiasing approaches: an in-processing fair sampling method to address the gender imbalance issue for training models, and a postprocessing feature clipping method base on mutual information to debias multimodal representations of pre-trained models. Extensive experiments on MS-COCO (Lin et al., 2014) and Flickr30K (Young et al., 2014) benchmarks show that our methods significantly reduce the gender bias in image search models.

show abstract

Adaptive Offline Quintuplet Loss for Image-Text Matching

Cited by 56 publications

References 26 publications

Global Relation-Aware Attention Network for Image-Text Retrieval

Global Relation-Aware Attention Network for Image-Text Retrieval

Structure and Semantics Preserving Document Representations

Are Gender-Neutral Queries Really Gender-Neutral? Mitigating Gender Bias in Image Search

Contact Info

Product

Resources

About