2022
DOI: 10.1007/978-3-031-19833-5_23
|View full text |Cite
|
Sign up to set email alerts
|

Frozen CLIP Models are Efficient Video Learners

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 58 publications
(15 citation statements)
references
References 27 publications
0
12
0
Order By: Relevance
“…Hence, it is better to optimize diffusion models to enable the inverse back projection of noise sequences into a latent time-continuous video space aligned with human perception [55,32,76]. To achieve this aim, previous work incorporates the power of large language model (LLM) into the design of video generation [75], while other approaches effectively capture information using a frozen CLIP encoder [33]. However, they primarily consider semantic correlations between frames and do not adequately address dense correlations (e.g., patches, key points) across frames, which are crucial for frame continuity at a finer granularity.…”
Section: 3mentioning
confidence: 99%
“…Hence, it is better to optimize diffusion models to enable the inverse back projection of noise sequences into a latent time-continuous video space aligned with human perception [55,32,76]. To achieve this aim, previous work incorporates the power of large language model (LLM) into the design of video generation [75], while other approaches effectively capture information using a frozen CLIP encoder [33]. However, they primarily consider semantic correlations between frames and do not adequately address dense correlations (e.g., patches, key points) across frames, which are crucial for frame continuity at a finer granularity.…”
Section: 3mentioning
confidence: 99%
“…Representative works such as CLIP [43] project images and natural language descriptions to a common feature space through two separate encoders for contrastive learning, and achieve significant "zero-shot" transferability by pre-training on hundreds of millions of image-text pairs. Subsequently, these pre-trained models have been extended to various downstream tasks and shown excellent performance, including image classification [81,80], object detection [48,15], semantic segmentation [63,45], and video understanding [34,22,35]. Inspired by these successes, in this work we present the first simple but efficient framework to leverage the rich semantic knowledge of CLIP for fewshot action recognition.…”
Section: Related Workmentioning
confidence: 99%
“…Popular image-language models such as CLIP [83] and ALIGN [48] are trained on massive datasets by using web images and alt-text. Similarly, videolanguage models are catching up and can be categorised into two broad directions: (i) adapting image-language models for videos [8,22,49,50,62,65,71,108,110,119], and (ii) pure video-based models that are learned using large video-text datasets [3,7,[26][27][28]30,57,61,64,67,68,95,117]. Recently, a new paradigm of post-pretraining has emerged where an existing image-or video-language model goes through another stage of self-supervised pretraining on a small amount of video data before it is evaluated on downstream tasks [65,119].…”
Section: Foundational Video-language Modelsmentioning
confidence: 99%