2021
DOI: 10.1007/978-3-030-77004-4_1
|View full text |Cite
|
Sign up to set email alerts
|

A Straightforward Framework for Video Retrieval Using CLIP

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
43
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 74 publications
(43 citation statements)
references
References 13 publications
0
43
0
Order By: Relevance
“…Like other recent works (Gao et al 2021), we focus our results on this dataset. One of the reasons this is a good task and dataset for generally testing the value of the SMs approach is that there is already a strong zero-shot baseline, provided by Portillo-Quintero et al (2021), which uses CLIP by itself, but does not use the Socratic method: there is no multimodel exchange, and no LMs are used. Additionally, this task provides a great opportunity to incorporate another type of modality -speech-to-text from audio data.…”
Section: Msr-vtt 1k-amentioning
confidence: 99%
“…Like other recent works (Gao et al 2021), we focus our results on this dataset. One of the reasons this is a good task and dataset for generally testing the value of the SMs approach is that there is already a strong zero-shot baseline, provided by Portillo-Quintero et al (2021), which uses CLIP by itself, but does not use the Socratic method: there is no multimodel exchange, and no LMs are used. Additionally, this task provides a great opportunity to incorporate another type of modality -speech-to-text from audio data.…”
Section: Msr-vtt 1k-amentioning
confidence: 99%
“…Refining Large-scale Image Models. CLIP's strong visual representation inspired multiple researchers to explore its usage for video tasks [9,14,30,37,38,47]. CLIP4CLIP, for instance, proposed a straightforward strategy to finetune CLIP for the text-to-video retrieval task [30].…”
Section: Related Workmentioning
confidence: 99%
“…We compare with the state-of-the-arts that use the same data splits as we have mentioned. In particular, the following eleven published models are included for comparison: W2VV++ [23], Collaborative Experts (CE) [29], Tree-augmented Cross-modal Encoding (TCE) [53], Hierarchical Graph Reasoning (HGR) [8], Sentence Encoder Assembly (SEA) [26], Multi-Modal Transformers (MMT) [16], Dual Encoding (DE) [11], Support-Set Bottlenecks (SSB) [37], Self-Supervised Multi-modal Learning (SSML) [1], CLIP [38] and CLIP with Feature Re-Learning (CILP-FRL) [6]. Furthermore, we fine-tune the CLIP, termed CLIP-FT. After this, we retrain W2VV++ and SEA with the same features as ours and apply LAFFml to substitute for the mean pooling mechanism on the frame-level feature.…”
Section: Combined Loss Versus Single Lossmentioning
confidence: 99%