2022
DOI: 10.48550/arxiv.2212.03191
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
26
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 12 publications
(26 citation statements)
references
References 0 publications
0
26
0
Order By: Relevance
“…As shown in Table 5, we find that current researchers typically improve the zeroshot retrieval performance by pretraining on a vast array of datasets. For example, OMNIVL [70] trains models on 7 more datasets in addition to ImageNet, CLIP4CLIP [9] finetunes the CLIP-based model on HowTo100M-380k dataset, and InternVideo [11] fine-tunes the CLIP-based model on 9 more large datasets in addition to WIT-400M [4]. Instead, Open-VCLIP++ trains CLIP models only on Kinetics-400 and achieves comparable or better results.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…As shown in Table 5, we find that current researchers typically improve the zeroshot retrieval performance by pretraining on a vast array of datasets. For example, OMNIVL [70] trains models on 7 more datasets in addition to ImageNet, CLIP4CLIP [9] finetunes the CLIP-based model on HowTo100M-380k dataset, and InternVideo [11] fine-tunes the CLIP-based model on 9 more large datasets in addition to WIT-400M [4]. Instead, Open-VCLIP++ trains CLIP models only on Kinetics-400 and achieves comparable or better results.…”
Section: Methodsmentioning
confidence: 99%
“…SS-V2 [68], K400 [46] and K710 [69] are action recognition datasets. SC-V denotes to the self-collected video dataset in [11]. adapter modules, while keeping the original parameters of the CLIP model frozen.…”
Section: Comparison With Parameter-efficient Fine-tuningmentioning
confidence: 99%
See 1 more Smart Citation
“…With the great success of the vision-language pretrained models, some work [1,26] has directly used the VideoQA task as a downstream task on which to fine-tune the pretrained model. The image-language pretrained model [3,11,13,23,31] has more advances than the video-language pretrained model [19,22,24,29,32]. In this paper, our work builds on two of the current state-of-the-art image-language pretrained models [12,15] for entity detection and question answering, respectively.…”
Section: Related Workmentioning
confidence: 99%
“…Video foundation models (ViFMs) hold enormous potential to unlock new insights within this vast corpus. While prior work has made great progress towards general video understanding (Xu et al, 2021;Wang et al, 2022c;Yan et al, 2022;Tong et al, 2022;Li et al, 2023b;Wang et al, 2023c), building a truly foundational video model is still an elusive goal. Existing mod-els often struggle to balance appearance-heavy tasks with motion-centric reasoning, falling behind task-specialized models across many benchmarks (Yuan et al, 2023).…”
Section: Introductionmentioning
confidence: 99%