A CLIP-Enhanced Method for Video-Language Understanding

Li, Guohao; He, Feng; Feng, Zhifan

doi:10.48550/arxiv.2110.07137

Cited by 4 publications

(4 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2021) adopted contrastive learning to promote the representation learning of linguistic modality. Recently, large-scale multi-modal pre-trained models (Li et al 2020;Li, He, and Feng 2021;Zellers et al 2021;Yang et al 2022a;Zellers et al 2022;Yang et al 2022b) have achieved significant progress in unified representation learning by leveraging the consistency and complementary of different modality. In addition, Lin et al (Lin et al 2021) explored a unified model to learn knowledge of different modalities and used a generative method to solve multi-choice video question answering tasks.…”

Section: Related Work Video Question Answeringmentioning

confidence: 99%

Inferential Knowledge-Enhanced Integrated Reasoning for Video Question Answering

Mao

Jiang

Liu

et al. 2023

AAAI

View full text Add to dashboard Cite

Recently, video question answering has attracted growing attention. It involves answering a question based on a fine-grained understanding of video multi-modal information. Most existing methods have successfully explored the deep understanding of visual modality. We argue that a deep understanding of linguistic modality is also essential for answer reasoning, especially for videos that contain character dialogues. To this end, we propose an Inferential Knowledge-Enhanced Integrated Reasoning method. Our method consists of two main components: 1) an Inferential Knowledge Reasoner to generate inferential knowledge for linguistic modality inputs that reveals deeper semantics, including the implicit causes, effects, mental states, etc. 2) an Integrated Reasoning Mechanism to enhance video content understanding and answer reasoning by leveraging the generated inferential knowledge. Experimental results show that our method achieves significant improvement on two mainstream datasets. The ablation study further demonstrates the effectiveness of each component of our approach.

show abstract

Section: Related Work Video Question Answeringmentioning

confidence: 99%

Inferential Knowledge-Enhanced Integrated Reasoning for Video Question Answering

Mao

Jiang

Liu

et al. 2023

AAAI

View full text Add to dashboard Cite

show abstract

“…How2QA dataset (Li et al 2020) contains 31.7k video clips. Baseline models include HERO (Li et al 2020), the 2021 ICCV VALUE winner Craig.Starr (Shin et al 2021), DUKG (Li, He, and Feng 2021), CLIP (Radford et al 2021), CLIP+SlowFast and ResNet+SlowFast ). Results on the public test split are listed in Table 5.…”

Section: Video Question and Answeringmentioning

confidence: 99%

i-Code: An Integrative and Composable Multimodal Learning Framework

Yang

Fang

et al. 2023

AAAI

View full text Add to dashboard Cite

Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities. We present i-Code, a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations. In this framework, data from each modality are first given to pretrained single-modality encoders. The encoder outputs are then integrated with a multimodal fusion network, which uses novel merge- and co-attention mechanisms to effectively combine information from the different modalities. The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning. Unlike previous research using only video for pretraining, the i-Code framework can dynamically process single, dual, and triple-modality data during training and inference, flexibly projecting different combinations of modalities into a single representation space. Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five multimodal understanding tasks and single-modality benchmarks, improving by as much as 11% and demonstrating the power of integrative multimodal pretraining.

show abstract

“…The first VQA benchmark is How2QA (Li et al, 2020), which contains 31.7k video clips collected from HowTo100M (Miech et al, 2019). The baseline models on How2QA inlcude HERO (Li et al, 2020), the 2021 ICCV VALUE winner Craig.Starr (Shin et al, 2021), DUKG (Li et al, 2021a), CLIP , CLIP+SlowFast features and ResNet+SlowFast features (Li et al, 2021b).…”

Section: Video Question and Answeringmentioning

confidence: 99%

i-Code: An Integrative and Composable Multimodal Learning Framework

Yang¹,

Fang²,

C³

et al. 2022

Preprint

View full text Add to dashboard Cite

Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities. We present i-Code, a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations. In this framework, data from each modality are first given to pretrained single-modality encoders. The encoder outputs are then integrated with a multimodal fusion network, which uses novel attention mechanisms and other architectural innovations to effectively combine information from the different modalities. The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning. Unlike previous research using only video for pretraining, the i-Code framework can dynamically process single, dual, and triple-modality data during training and inference, flexibly projecting different combinations of modalities into a single representation space. Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11% and demonstrating the power of integrative multimodal pretraining.

show abstract

A CLIP-Enhanced Method for Video-Language Understanding

Cited by 4 publications

References 8 publications

Inferential Knowledge-Enhanced Integrated Reasoning for Video Question Answering

Inferential Knowledge-Enhanced Integrated Reasoning for Video Question Answering

i-Code: An Integrative and Composable Multimodal Learning Framework

i-Code: An Integrative and Composable Multimodal Learning Framework

Contact Info

Product

Resources

About