Adaptive Fine-Grained Sketch-Based Image Retrieval

Bhunia, Ayan Kumar; Sain, Aneeshan; Shah, Parth Hiren; Gupta, Animesh; Chowdhury, Pinaki Nath; Xiang, Tao; Song, Yi-Zhe

doi:10.1007/978-3-031-19836-6_10

Cited by 15 publications

(13 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Furthermore, sketch-traits like style-diversity [52], datascarcity [5] and redundancy of sketch-strokes [6] were addressed in favor of retrieval. Towards generalising to novel classes, while [42] modelled a universal manifold of prototypical visual sketch traits embedding sketch and photo, [8] adapted to new classes via some supporting sketch-photo pairs. In this paper, we aim to address the problem of zeroshot cross-category FG-SBIR, leveraging the zero-shot potential of a foundation model like CLIP [46].…”

Section: Related Workmentioning

confidence: 99%

“…In particular, along with the triplet loss, we impose a classification loss on the sketch/photo joint-embedding space. For this, instead of usual auxiliary N s -class FC-layer based classification head [8,16,19], we take help of CLIP's text encoder to compute the classification objective, which is already enriched with semantic-visual association. Following [24], we construct a set of handcrafted prompt templates like 'a photo of a [category]' to obtain a list classification weight vectors {t j } Ns j=1 using CLIP's text encoder where the '[category]' token is filled with a specific class name from a list of N s seen classes.…”

Section: Prompt Learning For Zs-sbirmentioning

confidence: 99%

“…A few works have extended it to multi-category FG-SBIR [8] setup which aims to train a single model with instance-level matching from multiple (N c ) categories (e.g., Sketchy dataset [53]). The dataset consists of sketch/photo pairs from multiple categories {s j i , p j i } kj i=1 | Nc j=1 with every j th class having k j sketch-photo pairs.…”

Section: Clip For Zero-shot Fine-grained Sbirmentioning

confidence: 99%

“…However, there are two major bottlenecks: Firstly, due to instance level matching across categories, the categoryspecific margin-parameter of triplet loss (µ) varies significantly [8], showing that a single global margin-value alone is sub-optimal for training a FG-ZS-SBIR model. Secondly, due to the diverse shape morphology [52] amongst varying categories, it becomes extremely challenging to recognise fine-grained associations for unseen classes whose shape is unknown.…”

Section: Clip For Zero-shot Fine-grained Sbirmentioning

confidence: 99%

“…The new fine-grained setting [13,44] is however more tricky. Unlike the previous category-level setup, it poses two additional challenges (i) relative feature-distances between sketch-photo pairs across categories are non-uniform, which is reflected in the varying triplet-loss margin [70] across categories at training [8], and (ii) apart from semantic consistency, fine-grained ZS-SBIR requires instance-level matching to be conducted [44], which dictates additional constraints such as structural correspondences.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not

Sain¹,

Bhunia²,

Chowdhury³

et al. 2023

Preprint

View full text Add to dashboard Cite

In this paper, we leverage CLIP for zero-shot sketch based image retrieval (ZS-SBIR). We are largely inspired by recent advances on foundation models and the unparalleled generalisation ability they seem to offer, but for the first time tailor it to benefit the sketch community. We put forward novel designs on how best to achieve this synergy, for both the category setting and the fine-grained setting ("all"). At the very core of our solution is a prompt learning setup. First we show just via factoring in sketch-specific prompts, we already have a category-level ZS-SBIR system that overshoots all prior arts, by a large margin (24.8%) -a great testimony on studying the CLIP and ZS-SBIR synergy. Moving onto the fine-grained setup is however trickier, and requires a deeper dive into this synergy. For that, we come up with two specific designs to tackle the fine-grained matching nature of the problem: (i) an additional regularisation loss to ensure the relative separation between sketches and photos is uniform across categories, which is not the case for the gold standard standalone triplet loss, and (ii) a clever patch shuffling technique to help establishing instance-level structural correspondences between sketchphoto pairs. With these designs, we again observe significant performance gains in the region of 26.9% over previous state-of-the-art. The take-home message, if any, is the proposed CLIP and prompt learning paradigm carries great promise in tackling other sketch-related tasks (not limited to ZS-SBIR) where data scarcity remains a great challenge. Project page: https:// aneeshan95.github.io/ Sketch LVM/

show abstract