A Straightforward Framework for Video Retrieval Using CLIP

Portillo-Quintero, Jesús Andrés; Ortíz-Bayliss, José Carlos; Terashima-Marín, Hugo

doi:10.1007/978-3-030-77004-4_1

Cited by 74 publications

(43 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Like other recent works (Gao et al 2021), we focus our results on this dataset. One of the reasons this is a good task and dataset for generally testing the value of the SMs approach is that there is already a strong zero-shot baseline, provided by Portillo-Quintero et al (2021), which uses CLIP by itself, but does not use the Socratic method: there is no multimodel exchange, and no LMs are used. Additionally, this task provides a great opportunity to incorporate another type of modality -speech-to-text from audio data.…”

Section: Msr-vtt 1k-amentioning

confidence: 99%

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Zeng¹,

Wong²,

Welker³

et al. 2022

Preprint

View full text Add to dashboard Cite

Large foundation models can exhibit unique capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internetscale text with no images (e.g. from spreadsheets, to SAT questions). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this model diversity is symbiotic, and can be leveraged to build AI systems with structured Socratic dialogue -in which new multimodal tasks are formulated as a guided languagebased exchange between different pre-existing foundation models, without additional finetuning. In the context of egocentric perception, we present a case study of Socratic Models (SMs) that can provide meaningful results for complex tasks such as generating freeform answers to contextual questions about egocentric video, by formulating video Q&A as short story Q&A, i.e. summarizing the video into a short story, then answering questions about it. Additionally, SMs can generate captions for Internet images, and are competitive with state-of-the-art on zero-shot video-to-text retrieval with 42.8 R@1 on MSR-VTT 1k-A. SMs demonstrate how to compose foundation models zeroshot to capture new multimodal functionalities, without domain-specific data collection. Prototypes are available at socraticmodels.github.io.

show abstract

Section: Msr-vtt 1k-amentioning

confidence: 99%

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Zeng¹,

Wong²,

Welker³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Refining Large-scale Image Models. CLIP's strong visual representation inspired multiple researchers to explore its usage for video tasks [9,14,30,37,38,47]. CLIP4CLIP, for instance, proposed a straightforward strategy to finetune CLIP for the text-to-video retrieval task [30].…”

Section: Related Workmentioning

confidence: 99%

FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks

Castro¹,

Heilbron²

2022

Preprint

View full text Add to dashboard Cite

Large-scale pretrained image-text models have shown incredible zero-shot performance in a handful of tasks, including video ones such as action recognition and textto-video retrieval. However, these models haven't been adapted to video, mainly because they don't account for the time dimension but also because video frames are different from the typical images (e.g., containing motion blur, less sharpness). In this paper, we present a fine-tuning strategy to refine these large-scale pretrained image-text models for zero-shot video understanding tasks. We show that by carefully adapting these models we obtain considerable improvements on two zero-shot Action Recognition tasks and three zero-shot Text-to-video Retrieval tasks. The code is available at https://github.com/bryant1410/ fitclip * Work done as an intern at Adobe Research.

show abstract

“…We compare with the state-of-the-arts that use the same data splits as we have mentioned. In particular, the following eleven published models are included for comparison: W2VV++ [23], Collaborative Experts (CE) [29], Tree-augmented Cross-modal Encoding (TCE) [53], Hierarchical Graph Reasoning (HGR) [8], Sentence Encoder Assembly (SEA) [26], Multi-Modal Transformers (MMT) [16], Dual Encoding (DE) [11], Support-Set Bottlenecks (SSB) [37], Self-Supervised Multi-modal Learning (SSML) [1], CLIP [38] and CLIP with Feature Re-Learning (CILP-FRL) [6]. Furthermore, we fine-tune the CLIP, termed CLIP-FT. After this, we retrain W2VV++ and SEA with the same features as ours and apply LAFFml to substitute for the mean pooling mechanism on the frame-level feature.…”

Section: Combined Loss Versus Single Lossmentioning

confidence: 99%

Lightweight Attentional Feature Fusion for Video Retrieval by Text

Fan¹,

Chen²,

Ziyue³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper we revisit feature fusion, an old-fashioned topic, in the new context of video retrieval by text. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of the features is preferred to modeling their correlations by computationally heavy multi-head self attention. Accordingly, we propose Lightweight Attentional Feature Fusion (LAFF). LAFF performs feature fusion at both early and late stages and at both video and text ends, making it a powerful method for exploiting diverse (off-the-shelf) features. Extensive experiments on four public datasets, i.e. MSR-VTT, MSVD, TGIF, VATEX, and the large-scale TRECVID AVS benchmark evaluations (2016)(2017)(2018)(2019)(2020) show the viability of LAFF. Moreover, LAFF is extremely simple to implement, making it appealing for real-world deployment.

show abstract

A Straightforward Framework for Video Retrieval Using CLIP

Cited by 74 publications

References 13 publications

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks

Lightweight Attentional Feature Fusion for Video Retrieval by Text

Contact Info

Product

Resources

About