Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling

Lei, Jie; Fu, Pingqing; Zhou, Luowei; Gan, Zhe; Berg, Tamara L.; Bansal, Mohit; Liu, Jingjing

doi:10.1109/cvpr46437.2021.00725

Cited by 393 publications

(313 citation statements)

References 41 publications

Supporting

Mentioning

311

Contrasting

Order By: Relevance

“…Although they adopt GloVe [36] embeddings for query, the issues of feature gap are well alleviated. Considering recent advances in video-based vision-language pretraining (e.g., BVET [168], ActBERT [169], ClipBERT [170], and VideoCLIP [171]), dedicated or more effective feature extractors for TSGV are much expected.…”

Section: Effective Feature Extractor(s)mentioning

confidence: 99%

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Zhang¹,

Sun²,

Wei³

et al. 2022

Preprint

View full text Add to dashboard Cite

Temporal sentence grounding in videos (TSGV), a.k.a., natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, TSGV has drawn significant attention from researchers in both communities. This survey attempts to provide a summary of fundamental concepts in TSGV and current research status, as well as future research directions. As the background, we present a common structure of functional components in TSGV, in a tutorial style: from feature extraction from raw video and language query, to answer prediction of the target moment. Then we review the techniques for multimodal understanding and interaction, which is the key focus of TSGV for effective alignment between the two modalities. We construct a taxonomy of TSGV techniques and elaborate methods in different categories with their strengths and weaknesses. Lastly, we discuss issues with the current TSGV research and share our insights about promising research directions.

show abstract

Section: Effective Feature Extractor(s)mentioning

confidence: 99%

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Zhang¹,

Sun²,

Wei³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…There have been a series of works on the interaction of computer vision and natural language processing fields, e.g., text-to-image retrieval [45], image caption [50], visual question answering [1], referring segmentation [19] and so on. Among these works, visionlanguage pre-training has attracted growing attention during the past few years [24,33,37]. As a milestone, Radford et al devise a large-scale pretraining model, named CLIP [34], which employs a contrastive learning strategy on a huge amount of image-text pairs, and shows impressive transferable ability over 30 classification datasets.…”

Section: Related Workmentioning

confidence: 99%

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Rao¹,

Wen-liang²,

Chen³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recent progress has shown that large-scale pre-training using contrastive image-text pairs can be a promising alternative for high-quality visual representation learning from natural language supervision. Benefiting from a broader source of supervision, this new paradigm exhibits impressive transferability to downstream classification tasks and datasets. However, the problem of transferring the knowledge learned from image-text pairs to more complex dense prediction tasks has barely been visited. In this work, we present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP. Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models. By further using the contextual information from the image to prompt the language model, we are able to facilitate our model to better exploit the pretrained knowledge. Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones including both CLIP models and ImageNet pre-trained models. Extensive experiments demonstrate the superior performance of our methods on semantic segmentation, object detection, and instance segmentation tasks. Code is available at https: //github.com/raoyongming/DenseCLIP.

show abstract

“…It falls into a line of works that learn a universal language encoder by pretraining with language modeling objectives. Recently, several attempts [30,33,46,49,13,48,69,29,20,28] have been made which utilize BERTs and Transformers as the backbone for cross-modal tasks. In video-text learning tasks, VideoBERT [48] transforms a video into spoken words paired with a series of images and applies a Transformer to learn joint representations.…”

Section: Transformer For Video-text Learningmentioning

confidence: 99%

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

Liu¹,

Fan²,

Qian³

et al. 2021

Preprint

View full text Add to dashboard Cite

Video-Text Retrieval has been a hot research topic with the explosion of multimedia data on the Internet. Transformer for video-text learning has attracted increasing attention due to the promising performance. However, existing cross-modal transformer approaches typically suffer from two major limitations: 1) Limited exploitation of the transformer architecture where different layers have different feature characteristics. 2) End-to-end training mechanism limits negative interactions among samples in a minibatch. In this paper, we propose a novel approach named Hierarchical Transformer (HiT) for video-text retrieval. HiT performs hierarchical cross-modal contrastive matching in feature-level and semantic-level to achieve multi-view and comprehensive retrieval results. Moreover, inspired by MoCo, we propose Momentum Cross-modal Contrast for cross-modal learning to enable large-scale negative interactions on-the-fly, which contributes to the generation of more precise and discriminative representations. Experimental results on three major Video-Text Retrieval benchmark datasets demonstrate the advantages of our methods.

show abstract

Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling

Cited by 393 publications

References 41 publications

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

Contact Info

Product

Resources

About