Leveraging Style and Content features for Text Conditioned Image Retrieval

Chawla, Pranit; Jandial, Surgan; Badjatiya, Pinkesh; Chopra, Ayush; Sarkar, Mausoom; Krishnamurthy, Balaji

doi:10.1109/cvprw53098.2021.00448

Cited by 8 publications

(6 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Anwaar et al ( 2021) use an autoencoder-based model to map the reference and the target images into the same complex space and learn the text modifier representation as a transformation in this space. Lee et al (2021) and Chawla et al (2021) both propose to disentangle the multi-modal information into content and style. resort to image's descriptive texts as side information to train a joint visual-semantic space, training a TIRG model on top.…”

Section: Related Workmentioning

confidence: 99%

“…In contrast to most methods described above, ARTEMIS does not compose modalities into a joint global feature for the query (Vo et al, 2019;Lee et al, 2021), does not compute costly cross-attention involving the target image (Hosseinzadeh & Wang, 2020;Chawla et al, 2021) and does not extract multi-level visual representations . Instead it leverages the textual modifier in simple attention mechanisms to weight the dimensions of the visual representation, emphasizing the characteristics on which the matching should focus.…”

Section: Related Workmentioning

confidence: 99%

“…The most standard approach for this task consists in fusing the features of the two components of the query into a single representation, so it can be compared to the representation of any potential target image (Vo et al, 2019;. Among the current strategies, some resort to rich external information Liu et al, 2021) while others rely on multi-level visual representations or heavy cross-attention architectures (Hosseinzadeh & Wang, 2020;Chawla et al, 2021;Liu et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity

Delmas¹,

Rezende²,

Larlus³

2022

Preprint

View full text Add to dashboard Cite

An intuitive way to search for images is to use queries composed of an example image and a complementary text. While the first provides rich and implicit context for the search, the latter explicitly calls for new traits, or specifies how some elements of the example image should be changed to retrieve the desired target image. Current approaches typically combine the features of each of the two elements of the query into a single representation, which can then be compared to the ones of the potential target images. Our work aims at shedding new light on the task by looking at it through the prism of two familiar and related frameworks: text-to-image and image-to-image retrieval. Taking inspiration from them, we exploit the specific relation of each query element with the targeted image and derive light-weight attention mechanisms which enable to mediate between the two complementary modalities. We validate our approach on several retrieval benchmarks, querying with images and their associated free-form text modifiers. Our method obtains state-of-the-art results without resorting to side information, multi-level features, heavy pre-training nor large architectures as in previous works.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity

Delmas¹,

Rezende²,

Larlus³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…A compositor plays a fundamental role to integrate the textual information with the imagery modality. TGR compositors have been proposed based on various techniques, such as gating mechanism [49], hierarchical attention [7,23,12,20], graph neural network [54,44], joint learning [6,27,44,52,55], ensemble learning [50], style-content modification [29,5] and vision & language pre-training [32].…”

Section: Related Workmentioning

confidence: 99%

UIGR: Unified Interactive Garment Retrieval

Han¹,

He²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Interactive garment retrieval (IGR) aims to retrieve a target garment image based on a reference garment image along with user feedback on what to change on the reference garment. Two IGR tasks have been studied extensively: text-guided garment retrieval (TGR) and visually compatible garment retrieval (VCR). The user feedback for the former indicates what semantic attributes to change with the garment category preserved, while the category is the only thing to be changed explicitly for the latter, with an implicit requirement on style preservation. Despite the similarity between these two tasks and the practical need for an efficient system tackling both, they have never been unified and modeled jointly. In this paper, we propose a Unified Interactive Garment Retrieval (UIGR) framework to unify TGR and VCR. To this end, we first contribute a large-scale benchmark suited for both problems. We further propose a strong baseline architecture to integrate TGR and VCR in one model. Extensive experiments suggest that unifying two tasks in one framework is not only more efficient by requiring a single model only, it also leads to better performance. Code and datasets are available at GitHub.

show abstract

“…The next issue is the application of the image search algorithm. Image Search is a fundamental task playing a significant role in the success of a wide variety of frameworks and applications [ 8 ]. An important method to compare semantic similarity between text and images is CLIP Contrastive Language-Image Pre-Training).…”

Section: Introductionmentioning

confidence: 99%

TIPS: A Framework for Text Summarising with Illustrative Pictures

Golec

Hachaj

Sokal

2021

Entropy

View full text Add to dashboard Cite

We propose an algorithm to generate graphical summarising of longer text passages using a set of illustrative pictures (TIPS). TIPS is an algorithm using a voting process that uses results of individual “weak” algorithms. The proposed method includes a summarising algorithm that generates a digest of the input document. Each sentence of the text summary is used as the input for further processing by the sentence transformer separately. A sentence transformer performs text embedding and a group of CLIP similarity-based algorithms trained on different image embedding finds semantic distances between images in the illustration image database and the input text. A voting process extracts the most matching images to the text. The TIPS algorithm allows the integration of the best (highest scored) results of the different recommendation algorithms by diminishing the influence of images that are a disjointed part of the recommendations of the component algorithms. TIPS returns a set of illustrative images that describe each sentence of the text summary. Three human judges found that the use of TIPS resulted in an increase in matching highly relevant images to text, ranging from 5% to 8% and images relevant to text ranging from 3% to 7% compared to the approach based on single-embedding schema.

show abstract

Leveraging Style and Content features for Text Conditioned Image Retrieval

Cited by 8 publications

References 7 publications

ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity

ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity

UIGR: Unified Interactive Garment Retrieval

TIPS: A Framework for Text Summarising with Illustrative Pictures

Contact Info

Product

Resources

About