SAC: Semantic Attention Composition for Text-Conditioned Image Retrieval

Jandial, Surgan; Badjatiya, Pinkesh; Chawla, Pranit; Chopra, Ayush; Sarkar, Mausoom; Krishnamurthy, Balaji

doi:10.1109/wacv51458.2022.00067

Cited by 18 publications

(3 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To evaluate our model, we chose three real-world datasets: Fashion200K (Han et al 2017), Shoes (Guo et al 2018), and FashionIQ (Wu et al 2021). We compare our DWC with many SOTA MMIR methods, such as TIRG (Vo et al 2019), JAMMAL (Zhang et al 2020), LBF (Hosseinzadeh and Wang 2020), JVSM (Chen and Bazzani 2020), SynthTripletGAN (Tautkute and Trzcinski 2021), VAL (Chen, Gong, and Bazzani 2020), DCNet (Kim et al 2021), JPM (Yang et al 2021b), DATIR (Gu et al 2021), ComposeAE (Anwaar, Labintcev, and Kleinsteuber 2021), CoSMo (Lee, Kim, and Han 2021), CLVC-Net (Wen et al 2021), ARTEMIS (Delmas et al 2022), SAC (Jandial et al 2022), GA (Huang et al 2022), CIRPLANT (Liu et al 2021), Combiner w/ CLIP (Baldrati et al 2022b), and Fash-ionVLP (Goenka et al 2022), where the methods in italic are based on VLP models.…”

Section: Methodsmentioning

confidence: 99%

Dynamic Weighted Combiner for Mixed-Modal Image Retrieval

Huang,

Zhang,

et al. 2024

AAAI

View full text Add to dashboard Cite

Mixed-Modal Image Retrieval (MMIR) as a flexible search paradigm has attracted wide attention. However, previous approaches always achieve limited performance, due to two critical factors are seriously overlooked. 1) The contribution of image and text modalities is different, but incorrectly treated equally. 2) There exist inherent labeling noises in describing users' intentions with text in web datasets from diverse real-world scenarios, giving rise to overfitting. We propose a Dynamic Weighted Combiner (DWC) to tackle the above challenges, which includes three merits. First, we propose an Editable Modality De-equalizer (EMD) by taking into account the contribution disparity between modalities, containing two modality feature editors and an adaptive weighted combiner. Second, to alleviate labeling noises and data bias, we propose a dynamic soft-similarity label generator (SSG) to implicitly improve noisy supervision. Finally, to bridge modality gaps and facilitate similarity learning, we propose a CLIP-based mutual enhancement module alternately trained by a mixed-modality contrastive loss. Extensive experiments verify that our proposed model significantly outperforms state-of-the-art methods on real-world datasets. The source code is available at https://github.com/fuxianghuang1/DWC.

show abstract

Section: Methodsmentioning

confidence: 99%

Dynamic Weighted Combiner for Mixed-Modal Image Retrieval

Huang,

Zhang,

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…Generally, there are two families of works on image retrieval with text feedback based on whether using the pre-trained model. The first line of works mainly studies how to properly combine the features of the two modalities [1,3,13,41]. Content-Style Modulation (CosMo) [18] proposes a new image-based compositor containing two independent modulators.…”

Section: Composed Image Retrieval With Text Feedbackmentioning

confidence: 99%

Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization

Chen¹,

Zheng²,

Ji³

et al. 2022

Preprint

View full text Add to dashboard Cite

We investigate composed image retrieval with text feedback. Users gradually look for the target of interest by moving from coarse to fine-grained feedback. However, existing methods merely focus on the latter, i.e., fine-grained search, by harnessing positive and negative pairs during training. This pair-based paradigm only considers the oneto-one distance between a pair of specific points, which is not aligned with the one-to-many coarse-grained retrieval process and compromises the recall rate.In an attempt to fill this gap, we introduce a unified learning approach to simultaneously modeling the coarseand fine-grained retrieval by considering the multi-grained uncertainty. The key idea underpinning the proposed method is to integrate fine-and coarse-grained retrieval as matching data points with small and large fluctuations, respectively. Specifically, our method contains two modules: uncertainty modeling and uncertainty regularization.(1) The uncertainty modeling simulates the multi-grained queries by introducing identically distributed fluctuations in the feature space. (2) Based on the uncertainty modeling, we further introduce uncertainty regularization to adapt the matching objective according to the fluctuation range. Compared with existing methods, the proposed strategy explicitly prevents the model from pushing away potential candidates in the early stage, and thus improves the recall rate. On the three public datasets, i.e., FashionIQ, Fashion200k, and Shoes, the proposed method has achieved +4.03%, + 3.38%, and + 2.40% Recall@50 accuracy over a strong baseline, respectively.

show abstract

“…Multi-label lazy learning approach was given based on its k nearest neighbors, and maximum a posteriori (MAP) principle was utilized to determine its category. Jandial et al (2022) gave a novel semantic attention composition framework for text-conditioned image retrieval including semantic feature attention and semantic feature modification. However, this method can only retrieve 3D model by tags, and cannot retrieve 3D model based on its content.…”

Section: Introductionmentioning

confidence: 99%

3D model retrieval based on interactive attention CNN and multiple features

Jia¹,

Zhang²

2023

PeerJ Computer Science

View full text Add to dashboard Cite

3D (three-dimensional) models are widely applied in our daily life, such as mechanical manufacture, games, biochemistry, art, virtual reality, and etc. With the exponential growth of 3D models on web and in model library, there is an increasing need to retrieve the desired model accurately according to freehand sketch. Researchers are focusing on applying machine learning technology to 3D model retrieval. In this article, we combine semantic feature, shape distribution features and gist feature to retrieve 3D model based on interactive attention convolutional neural networks (CNN). The purpose is to improve the accuracy of 3D model retrieval. Firstly, 2D (two-dimensional) views are extracted from 3D model at six different angles and converted into line drawings. Secondly, interactive attention module is embedded into CNN to extract semantic features, which adds data interaction between two CNN layers. Interactive attention CNN extracts effective features from 2D views. Gist algorithm and 2D shape distribution (SD) algorithm are used to extract global features. Thirdly, Euclidean distance is adopted to calculate the similarity of semantic feature, the similarity of gist feature and the similarity of shape distribution feature between sketch and 2D view. Then, the weighted sum of three similarities is used to compute the similarity between sketch and 2D view for retrieving 3D model. It solves the problem that low accuracy of 3D model retrieval is caused by the poor extraction of semantic features. Nearest neighbor (NN), first tier (FT), second tier (ST), F-measure (E(F)), and discounted cumulated gain (DCG) are used to evaluate the performance of 3D model retrieval. Experiments are conducted on ModelNet40 and results show that the proposed method is better than others. The proposed method is feasible in 3D model retrieval.

show abstract

SAC: Semantic Attention Composition for Text-Conditioned Image Retrieval

Cited by 18 publications

References 32 publications

Dynamic Weighted Combiner for Mixed-Modal Image Retrieval

Dynamic Weighted Combiner for Mixed-Modal Image Retrieval

Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization

3D model retrieval based on interactive attention CNN and multiple features

Contact Info

Product

Resources

About