2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.00725
|View full text |Cite
|
Sign up to set email alerts
|

Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
311
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 393 publications
(313 citation statements)
references
References 41 publications
2
311
0
Order By: Relevance
“…Although they adopt GloVe [36] embeddings for query, the issues of feature gap are well alleviated. Considering recent advances in video-based vision-language pretraining (e.g., BVET [168], ActBERT [169], ClipBERT [170], and VideoCLIP [171]), dedicated or more effective feature extractors for TSGV are much expected.…”
Section: Effective Feature Extractor(s)mentioning
confidence: 99%
“…Although they adopt GloVe [36] embeddings for query, the issues of feature gap are well alleviated. Considering recent advances in video-based vision-language pretraining (e.g., BVET [168], ActBERT [169], ClipBERT [170], and VideoCLIP [171]), dedicated or more effective feature extractors for TSGV are much expected.…”
Section: Effective Feature Extractor(s)mentioning
confidence: 99%
“…There have been a series of works on the interaction of computer vision and natural language processing fields, e.g., text-to-image retrieval [45], image caption [50], visual question answering [1], referring segmentation [19] and so on. Among these works, visionlanguage pre-training has attracted growing attention during the past few years [24,33,37]. As a milestone, Radford et al devise a large-scale pretraining model, named CLIP [34], which employs a contrastive learning strategy on a huge amount of image-text pairs, and shows impressive transferable ability over 30 classification datasets.…”
Section: Related Workmentioning
confidence: 99%
“…It falls into a line of works that learn a universal language encoder by pretraining with language modeling objectives. Recently, several attempts [30,33,46,49,13,48,69,29,20,28] have been made which utilize BERTs and Transformers as the backbone for cross-modal tasks. In video-text learning tasks, VideoBERT [48] transforms a video into spoken words paired with a series of images and applies a Transformer to learn joint representations.…”
Section: Transformer For Video-text Learningmentioning
confidence: 99%