2021
DOI: 10.48550/arxiv.2110.07137
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A CLIP-Enhanced Method for Video-Language Understanding

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 8 publications
0
4
0
Order By: Relevance
“…2021) adopted contrastive learning to promote the representation learning of linguistic modality. Recently, large-scale multi-modal pre-trained models (Li et al 2020;Li, He, and Feng 2021;Zellers et al 2021;Yang et al 2022a;Zellers et al 2022;Yang et al 2022b) have achieved significant progress in unified representation learning by leveraging the consistency and complementary of different modality. In addition, Lin et al (Lin et al 2021) explored a unified model to learn knowledge of different modalities and used a generative method to solve multi-choice video question answering tasks.…”
Section: Related Work Video Question Answeringmentioning
confidence: 99%
“…2021) adopted contrastive learning to promote the representation learning of linguistic modality. Recently, large-scale multi-modal pre-trained models (Li et al 2020;Li, He, and Feng 2021;Zellers et al 2021;Yang et al 2022a;Zellers et al 2022;Yang et al 2022b) have achieved significant progress in unified representation learning by leveraging the consistency and complementary of different modality. In addition, Lin et al (Lin et al 2021) explored a unified model to learn knowledge of different modalities and used a generative method to solve multi-choice video question answering tasks.…”
Section: Related Work Video Question Answeringmentioning
confidence: 99%
“…How2QA dataset (Li et al 2020) contains 31.7k video clips. Baseline models include HERO (Li et al 2020), the 2021 ICCV VALUE winner Craig.Starr (Shin et al 2021), DUKG (Li, He, and Feng 2021), CLIP (Radford et al 2021), CLIP+SlowFast and ResNet+SlowFast ). Results on the public test split are listed in Table 5.…”
Section: Video Question and Answeringmentioning
confidence: 99%
“…The first VQA benchmark is How2QA (Li et al, 2020), which contains 31.7k video clips collected from HowTo100M (Miech et al, 2019). The baseline models on How2QA inlcude HERO (Li et al, 2020), the 2021 ICCV VALUE winner Craig.Starr (Shin et al, 2021), DUKG (Li et al, 2021a), CLIP , CLIP+SlowFast features and ResNet+SlowFast features (Li et al, 2021b).…”
Section: Video Question and Answeringmentioning
confidence: 99%