InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Wang, Yi; Li, Kunchang; Li, Yizhuo; He, Yuanqing; Huang, Bingkun; Zhao, Zhiyu; Zhang, Hongjie; Xu, Jilan; Liu, Yi; Wang, Zun; Xing, Sen; Chen, Guo; Pan, Junting; Yu, Jiashuo; Wang, Yali; Wang, Limin; Qiao, Yu

doi:10.48550/arxiv.2212.03191

Cited by 12 publications

(26 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As shown in Table 5, we find that current researchers typically improve the zeroshot retrieval performance by pretraining on a vast array of datasets. For example, OMNIVL [70] trains models on 7 more datasets in addition to ImageNet, CLIP4CLIP [9] finetunes the CLIP-based model on HowTo100M-380k dataset, and InternVideo [11] fine-tunes the CLIP-based model on 9 more large datasets in addition to WIT-400M [4]. Instead, Open-VCLIP++ trains CLIP models only on Kinetics-400 and achieves comparable or better results.…”

Section: Methodsmentioning

confidence: 99%

“…SS-V2 [68], K400 [46] and K710 [69] are action recognition datasets. SC-V denotes to the self-collected video dataset in [11]. adapter modules, while keeping the original parameters of the CLIP model frozen.…”

Section: Comparison With Parameter-efficient Fine-tuningmentioning

confidence: 99%

“…Among a plethora of work studying zero-shot learning [1][2][3], CLIP [4] recently emerges as a strong zero-shot learner. Leveraging web-scale image and text pairs in a contrastive manner, CLIP delivers remarkable zero-shot image recognition results across a broad spectrum of tasks, ranging from image segmentation [5,6], image editing [7,8], vision-to-text retrieval [9][10][11], out-ofdistribution detection [12][13][14], etc. While significant improvements have been made in zero-shot image recognition, exploring CLIP for zero-shot video action recognition remains under-explored.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching

Wu,

Zhu,

Zhao

et al. 2023

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition. This paper presents Open-VCLIP++, a simple yet effective framework that adapts CLIP to a strong zero-shot video classifier, capable of identifying novel actions and events during testing. Open-VCLIP++ minimally modifies CLIP to capture spatial-temporal relationships in videos, thereby creating a specialized video classifier while striving for generalization. We formally demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data. To address this problem, we introduce Interpolated Weight Optimization, a technique that leverages the advantages of weight interpolation during both training and testing. Furthermore, we build upon large language models to produce fine-grained video descriptions. These detailed descriptions are further aligned with video features, facilitating a better transfer of CLIP to the video domain. Our approach is evaluated on three widely used action recognition datasets, following a variety of zero-shot evaluation protocols. The results demonstrate that our method surpasses existing state-of-the-art techniques by significant margins. Specifically, we achieve zero-shot accuracy scores of 88.1%, 58.7%, and 81.2% on UCF, HMDB, and Kinetics-600 datasets respectively, outpacing the best-performing alternative methods by 8.5%, 8.2%, and 12.3%. We also evaluate our approach on the MSR-VTT video-text retrieval dataset, where it delivers competitive video-to-text and text-to-video retrieval performance, while utilizing substantially less fine-tuning data compared to other methods. Code is released at https://github.com/wengzejia1/Open-VCLIP.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Comparison With Parameter-efficient Fine-tuningmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching

Wu,

Zhu,

Zhao

et al. 2023

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

show abstract

“…With the great success of the vision-language pretrained models, some work [1,26] has directly used the VideoQA task as a downstream task on which to fine-tune the pretrained model. The image-language pretrained model [3,11,13,23,31] has more advances than the video-language pretrained model [19,22,24,29,32]. In this paper, our work builds on two of the current state-of-the-art image-language pretrained models [12,15] for entity detection and question answering, respectively.…”

Section: Related Workmentioning

confidence: 99%

Query-aware Long Video Localization and Relation Discrimination for Deep Video Understanding

Xu,

Wei,

2023

Proceedings of the 31st ACM International Conference on Multimedia

View full text Add to dashboard Cite

The surge in video and social media content underscores the need for a deeper understanding of multimedia data. Most of the existing mature video understanding techniques perform well with short formats and content that requires only shallow understanding, but do not perform well with long format videos that require deep understanding and reasoning. Deep Video Understanding (DVU) Challenge aims to push the boundaries of multimodal extraction, fusion, and analytics to address the problem of holistically analyzing long videos and extract useful knowledge to solve different types of queries. This paper introduces a query-aware method for long video localization and relation discrimination, leveraging an imagelanguage pretrained model. This model adeptly selects frames pertinent to queries, obviating the need for a complete movie-level knowledge graph. Our approach achieved first and fourth positions for two groups of movie-level queries. Sufficient experiments and final rankings demonstrate its effectiveness and robustness.

show abstract

“…Video foundation models (ViFMs) hold enormous potential to unlock new insights within this vast corpus. While prior work has made great progress towards general video understanding (Xu et al, 2021;Wang et al, 2022c;Yan et al, 2022;Tong et al, 2022;Li et al, 2023b;Wang et al, 2023c), building a truly foundational video model is still an elusive goal. Existing mod-els often struggle to balance appearance-heavy tasks with motion-centric reasoning, falling behind task-specialized models across many benchmarks (Yuan et al, 2023).…”

Section: Introductionmentioning

confidence: 99%

Equal parental contribution to the transcriptome is not equal control of embryogenesis

et al. 2020

View full text Add to dashboard Cite

We present EasyGen, an efficient model designed to enhance multimodal understanding and generation by harnessing the capabilities of diffusion models and large language models (LLMs). Unlike existing multimodal models that predominately depend on encoders like CLIP or ImageBind and need ample amounts of training data to bridge the gap between modalities, EasyGen is built upon a bidirectional conditional diffusion model named BiDiffuser, which promotes more efficient interactions between modalities. EasyGen handles image-to-text generation by integrating BiDiffuser and an LLM via a simple projection layer. Unlike most existing multimodal models that are limited to generating text responses, EasyGen can also facilitate text-to-image generation by leveraging the LLM to create textual descriptions, which can be interpreted by BiDiffuser to generate appropriate visual responses. Extensive quantitative and qualitative experiments demonstrate the effectiveness of EasyGen, whose training can be easily achieved in a lab setting. The source code is available at https://github.com/zxy556677/EasyGen.

show abstract

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Cited by 12 publications

References 0 publications

CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching

CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching

Query-aware Long Video Localization and Relation Discrimination for Deep Video Understanding

Equal parental contribution to the transcriptome is not equal control of embryogenesis

Contact Info

Product

Resources

About