CLIP model is an Efficient Continual Learner

Thengane, Vishal; Khan, Salman A.; Hayat, Munawar; Khan, Fahad Shahbaz

doi:10.48550/arxiv.2210.03114

Cited by 1 publication

(1 citation statement)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Here, V B denotes the video dataset for task B, T B denotes the corresponding label set, f θ V B denotes the visual encoder, and f θ T B denotes the text encoder. It is worth noting that the text encoder is usually frozen during training [44,45], thus the fine-tuning stage primarily concentrates on the optimization of the visual encoder for adaptation to the video domain. For the sake of brevity, the superscript will be omitted in the subsequent paragraphs.…”

Section: Preliminary: Video Action Recognition Us-ing Clipmentioning

confidence: 99%

CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching

Wu,

Zhu,

Zhao

et al. 2023

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition. This paper presents Open-VCLIP++, a simple yet effective framework that adapts CLIP to a strong zero-shot video classifier, capable of identifying novel actions and events during testing. Open-VCLIP++ minimally modifies CLIP to capture spatial-temporal relationships in videos, thereby creating a specialized video classifier while striving for generalization. We formally demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data. To address this problem, we introduce Interpolated Weight Optimization, a technique that leverages the advantages of weight interpolation during both training and testing. Furthermore, we build upon large language models to produce fine-grained video descriptions. These detailed descriptions are further aligned with video features, facilitating a better transfer of CLIP to the video domain. Our approach is evaluated on three widely used action recognition datasets, following a variety of zero-shot evaluation protocols. The results demonstrate that our method surpasses existing state-of-the-art techniques by significant margins. Specifically, we achieve zero-shot accuracy scores of 88.1%, 58.7%, and 81.2% on UCF, HMDB, and Kinetics-600 datasets respectively, outpacing the best-performing alternative methods by 8.5%, 8.2%, and 12.3%. We also evaluate our approach on the MSR-VTT video-text retrieval dataset, where it delivers competitive video-to-text and text-to-video retrieval performance, while utilizing substantially less fine-tuning data compared to other methods. Code is released at https://github.com/wengzejia1/Open-VCLIP.

show abstract

Section: Preliminary: Video Action Recognition Us-ing Clipmentioning

confidence: 99%