Cross-modal Joint Prediction and Alignment for Composed Query Image Retrieval

Yang, Yuchen; Wang, Min; Zhou, Wengang; Li, Houqiang

doi:10.1145/3474085.3475483

Cited by 16 publications

(8 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Subsequently, more and more works focus on this CTI-IR task. The previous works [4,5,12,15,18,19,34,36,39,41] can be categorized into two types. The first type of works [5,15,19,34,36] mainly focus on the multi-modal fusion between image and text queries.…”

Section: Related Work 21 Image Retrievalmentioning

confidence: 99%

“…Previous approaches [5,12,15,18,19,34,36,39] for this task can be categorized into two types. The first type of works [5,15,19,34,36] mainly focus on designing complex components for the multi-modal fusion between text and image queries.…”

Section: Introductionmentioning

confidence: 99%

“…Wen et al [36] propose to combine local-wise and global-wise composition modules for both local and global modification demands. The second type of works [12,18,39] focus on enhancing the semantic embedding space by combining the image&text-to-image matching and image&image-to-text matching with multi-task learning. For example, Yang et al [39] propose an auxiliary module to align the difference between reference and target images with the modification text by joint prediction.…”

Section: Introductionmentioning

confidence: 99%

“…The second type of works [12,18,39] focus on enhancing the semantic embedding space by combining the image&text-to-image matching and image&image-to-text matching with multi-task learning. For example, Yang et al [39] propose an auxiliary module to align the difference between reference and target images with the modification text by joint prediction. However, due to the scarcity of supervised data which needs to be in the triplet format as <reference-image, modification-text, target-image>, and the complexity of CTI-IR task which requires both the semantic space learning for target retrieval and cross-modal fusion between hybrid-modality queries, it is hard to effectively learn the complex knowledge together and thus the existing methods in both two types achieve marginally satisfactory retrieval results, with only about 30% of queries retrieving the correct image in the top-10 rank.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Progressive Learning for Image Retrieval with Hybrid-Modality Queries

Zhao¹,

Song²,

Jin³

2022

Preprint

View full text Add to dashboard Cite

Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR), is a retrieval task where the search intention is expressed in a more complex query format, involving both vision and text modalities. For example, a target product image is searched using a reference product image along with text about changing certain attributes of the reference image as the query. It is a more challenging image retrieval task that requires both semantic space learning and cross-modal fusion. Previous approaches that attempt to deal with both aspects achieve unsatisfactory performance. In this paper, we decompose the CTI-IR task into a three-stage learning problem to progressively learn the complex knowledge for image retrieval with hybrid-modality queries. We first leverage the semantic embedding space for opendomain image-text retrieval, and then transfer the learned knowledge to the fashion-domain with fashion-related pre-training tasks. Finally, we enhance the pre-trained model from single-query to hybrid-modality query for the CTI-IR task. Furthermore, as the contribution of individual modality in the hybrid-modality query varies for different retrieval scenarios, we propose a self-supervised adaptive weighting strategy to dynamically determine the importance of image and text in the hybrid-modality query for better retrieval. Extensive experiments show that our proposed model significantly outperforms state-of-the-art methods in the mean of Recall@K by 24.9% and 9.5% on the Fashion-IQ and Shoes benchmark datasets respectively. CCS CONCEPTS• Information systems → Image search.

show abstract

Section: Related Work 21 Image Retrievalmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Progressive Learning for Image Retrieval with Hybrid-Modality Queries

Zhao¹,

Song²,

Jin³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…A compositor plays a fundamental role to integrate the textual information with the imagery modality. TGR compositors have been proposed based on various techniques, such as gating mechanism [49], hierarchical attention [7,23,12,20], graph neural network [54,44], joint learning [6,27,44,52,55], ensemble learning [50], style-content modification [29,5] and vision & language pre-training [32].…”

Section: Related Workmentioning

confidence: 99%

UIGR: Unified Interactive Garment Retrieval

Han¹,

He²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Interactive garment retrieval (IGR) aims to retrieve a target garment image based on a reference garment image along with user feedback on what to change on the reference garment. Two IGR tasks have been studied extensively: text-guided garment retrieval (TGR) and visually compatible garment retrieval (VCR). The user feedback for the former indicates what semantic attributes to change with the garment category preserved, while the category is the only thing to be changed explicitly for the latter, with an implicit requirement on style preservation. Despite the similarity between these two tasks and the practical need for an efficient system tackling both, they have never been unified and modeled jointly. In this paper, we propose a Unified Interactive Garment Retrieval (UIGR) framework to unify TGR and VCR. To this end, we first contribute a large-scale benchmark suited for both problems. We further propose a strong baseline architecture to integrate TGR and VCR in one model. Extensive experiments suggest that unifying two tasks in one framework is not only more efficient by requiring a single model only, it also leads to better performance. Code and datasets are available at GitHub.

show abstract

EENet: embedding enhancement network for compositional image-text retrieval using generated text

Hur,

Park

2023

Multimed Tools Appl

View full text Add to dashboard Cite

Cross-modal Joint Prediction and Alignment for Composed Query Image Retrieval

Cited by 16 publications

References 37 publications

Progressive Learning for Image Retrieval with Hybrid-Modality Queries

Progressive Learning for Image Retrieval with Hybrid-Modality Queries

UIGR: Unified Interactive Garment Retrieval

EENet: embedding enhancement network for compositional image-text retrieval using generated text

Contact Info

Product

Resources

About