Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification

Zhang, Renrui; Zhang, Wei; Fang, Rongyao; Gao, Peng; Li, Kunchang; Dai, Jifeng; Qiao, Yu; Li, Hongsheng

doi:10.1007/978-3-031-19833-5_29

Cited by 82 publications

(39 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Fine‐tuning the CLIP model or using advanced language‐image models ( e.g. , [LLXH22, LSG * 21, ZZF * 22]) are promising. Obtaining enough chart‐description pairs for training is also necessary, even the process is time‐consuming and resource‐intensive.…”

Section: Discussionmentioning

confidence: 99%

WYTIWYR: A User Intent‐Aware Framework with Multi‐modal Inputs for Visualization Retrieval

Xiao

Hou

Cheng

et al. 2023

Computer Graphics Forum

View full text Add to dashboard Cite

Retrieving charts from a large corpus is a fundamental task that can benefit numerous applications such as visualization recommendations. The retrieved results are expected to conform to both explicit visual attributes (e.g., chart type, colormap) and implicit user intents (e.g., design style, context information) that vary upon application scenarios. However, existing example‐based chart retrieval methods are built upon non‐decoupled and low‐level visual features that are hard to interpret, while definition‐based ones are constrained to pre‐defined attributes that are hard to extend. In this work, we propose a new framework, namely WYTIWYR (What‐You‐Think‐Is‐What‐You‐Retrieve), that integrates user intents into the chart retrieval process. The framework consists of two stages: first, the Annotation stage disentangles the visual attributes within the query chart; and second, the Retrieval stage embeds the user's intent with customized text prompt as well as bitmap query chart, to recall targeted retrieval result. We develop aprototype WYTIWYR system leveraging a contrastive language‐image pre‐training (CLIP) model to achieve zero‐shot classification as well as multi‐modal input encoding, and test the prototype on a large corpus with charts crawled from the Internet. Quantitative experiments, case studies, and qualitative interviews are conducted. The results demonstrate the usability and effectiveness of our proposed framework.

show abstract

Section: Discussionmentioning

confidence: 99%

WYTIWYR: A User Intent‐Aware Framework with Multi‐modal Inputs for Visualization Retrieval

Xiao

Hou

Cheng

et al. 2023

Computer Graphics Forum

View full text Add to dashboard Cite

show abstract

“…Compared to CLIP-based few-shot adaption methods, including linear-probe CLIP, 4 CoOp, 5 WiSE-FT, 48 and Tip-Adapter-F, 66 our TEG combines the strengths of linear-probe CLIP for domain transfer and the guidance information from embedded text features, utilizing the proposed text-guided classifier to achieve superior performance in theme recognition. In addition, the auxiliary classification loss also helps the learning of visual concepts and stabilizes the network training.…”

Section: Methodsmentioning

confidence: 99%

TEG: image theme recognition using text-embedding-guided few-shot adaptation

Wang,

Lu,

Wang

et al. 2024

J. Electron. Imag.

View full text Add to dashboard Cite

Grouping images into different themes is a challenging task in photo book curation. Unlike image object recognition, image theme recognition focuses on the understanding of the main subject or overall meaning conveyed by an image. However, it is challenging to achieve satisfactory performance using existing general image recognition methods. In this work, we aim to solve the image theme recognition task with few-shot training samples using pre-trained contrastive language-image models. A text-prompt-guided few-shot image adaptation framework is proposed, which incorporates a text-embedding-guided classifier and an auxiliary classification loss to exploit embedded visual and text features, stabilize the network training, and enhance recognition performance. We also present an annotated dataset Theme25 for studying image theme recognition. We conducted experiments on our Theme25 dataset as well as the publicly available CIFAR100 and ImageNet datasets to demonstrate the superiority of our method over the compared stateof-the-art methods.

show abstract

“…A contrastive learning‐based image and NLP model developed by OpenAI. The Clip model combines unsupervised learning methods on large‐scale image and text corpora, aiming to embed text and images into the same space, allowing for similarity comparisons across different modalities. Tip‐Adapter (Zhang, Wei, et al., 2022 ). A lightweight and scalable model adaptation technology based on the Clip model, which enables rapid adaptation to new tasks or domains without the need to retrain the entire model.…”

Section: Methodsmentioning

confidence: 99%

Multimodal learning with only image data: A deep unsupervised model for street view image retrieval by fusing visual and scene text features of images

Wu,

Yu,

Zhang

et al. 2024

Transactions in GIS

View full text Add to dashboard Cite

As one of the classic tasks in information retrieval, the core of image retrieval is to identify the images sharing similar features with a query image, aiming to enable users to find the required information from a large number of images conveniently. Street view image retrieval, in particular, finds extensive applications in many fields, such as improvements to navigation and mapping services, formulation of urban development planning scheme, and analysis of historical evolution of buildings. However, the intricate foreground and background details in street view images, coupled with a lack of attribute annotations, render it among the most challenging issues in practical applications. Current image retrieval research mainly uses the visual model that is completely dependent on the image visual features, and the multimodal learning model that necessitates additional data sources (e.g., annotated text). Yet, creating annotated datasets is expensive, and street view images, which contain a large amount of scene texts themselves, are often unannotated. Therefore, this paper proposes a deep unsupervised learning algorithm that combines visual and text features from image data for improving the accuracy of street view image retrieval. Specifically, we employ text detection algorithms to identify scene text, utilize the Pyramidal Histogram of Characters encoding predictor model to extract text information from images, deploy deep convolutional neural networks for visual feature extraction, and incorporate a contrastive learning module for image retrieval. Upon testing across three street view image datasets, the results demonstrate that our model holds certain advantages over the state‐of‐the‐art multimodal models pre‐trained on extensive datasets, characterized by fewer parameters and lower floating point operations. Code and data are available at https://github.com/nwuSY/svtRetrieval.

show abstract

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification

Cited by 82 publications

References 31 publications

WYTIWYR: A User Intent‐Aware Framework with Multi‐modal Inputs for Visualization Retrieval

WYTIWYR: A User Intent‐Aware Framework with Multi‐modal Inputs for Visualization Retrieval

TEG: image theme recognition using text-embedding-guided few-shot adaptation

Multimodal learning with only image data: A deep unsupervised model for street view image retrieval by fusing visual and scene text features of images

Contact Info

Product

Resources

About