Multimodal Video Adapter for Parameter Efficient Video Text Retrieval

Zhang, Bowen; Jin, Xian-Min; Gong, Weibo; Xu, Kai; Zhang, Zhao; Wang, Peng; Shen, Xiaohui; Feng, Jiashi

doi:10.48550/arxiv.2301.07868

Cited by 3 publications

(3 citation statements)

References 32 publications

(76 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, (Jiang et al 2022) introduced a weight-share mechanism and adopted the query-scoring frame features reweighting method proposed in (Bain et al 2022) to boost performance. (Zhang et al 2023) proposed a temporal adaptation and cross-modal interaction modules. Prompt tuning (Lester, Al-Rfou, and Constant 2021) is another parameterefficient choice by introducing additional learnable parameters at the model's input.…”

Section: Related Workmentioning

confidence: 99%

DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

Yang,

Zhu,

Wang

et al. 2024

AAAI

View full text Add to dashboard Cite

Text-video retrieval is a critical multi-modal task to find the most relevant video for a text query. Although pretrained models like CLIP have demonstrated impressive potential in this area, the rising cost of fully finetuning these models due to increasing model size continues to pose a problem. To address this challenge, prompt tuning has emerged as an alternative. However, existing works still face two problems when adapting pretrained image-text models to downstream video-text tasks: (1) The visual encoder could only encode frame-level features and failed to extract global-level general video information. (2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap. To this end, we propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention. In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts that encourage inter-modal interaction. Furthermore, we propose modeling video in a global-local attention mechanism to capture global video information from the perspective of prompt tuning. Extensive experiments reveal that when only 0.67% parameters are tuned, our cross-modal prompt tuning strategy DGL outperforms or is comparable to fully finetuning methods on MSR-VTT, VATEX, LSMDC, and ActivityNet datasets. Code will be available at https://github.com/knightyxp/DGL.

show abstract

Section: Related Workmentioning

confidence: 99%

DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

Yang,

Zhu,

Wang

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…Previous video-language modeling methods (Liu et al 2022;Lei, Berg, and Bansal 2021;Yu et al 2021) employ pretrained Transformer models such as Unified Multimodal Transformer (UMT) (Liu et al 2022) and Vision-Guided BART (VG-BART) (Yu et al 2021), and fine-tune all the parameters of these models for every single task. This results in substantial storage overhead since each task demands storing a separate model (Zhang et al 2023). Moreover, because of the difficulty of collecting video-language data (Pan et al 2022), fully fine-tuning these over-parameterized models in low-resource scenarios, where limited training data is available, leads to instability and sub-optimal performance (Jiang et al 2022;Huang et al 2023).…”

Section: Introductionmentioning

confidence: 99%

“…To address these shortcomings, adapters are proposed as a parameter-efficient solution for finetuning video-language pretrained transformers (Jiang et al 2022;Zhang et al 2023;Yang et al 2023;Sung, Cho, and Bansal 2022;Chen et al 2022). The strategy is to add additional adaptation module to each layer of the pre-trained network and only the adaptation modules are trained during fine-tuning to improve the parameter-performance trade-off.…”

Section: Introductionmentioning

confidence: 99%

READ-PVLA: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling

Nguyen,

Wu,

Dong

et al. 2024

AAAI

View full text Add to dashboard Cite

Fully fine-tuning pretrained large-scale transformer models has become a popular paradigm for video-language modeling tasks, such as temporal language grounding and video-language summarization. With a growing number of tasks and limited training data, such full fine-tuning approach leads to costly model storage and unstable training. To overcome these shortcomings, we introduce lightweight adapters to the pre-trained model and only update them at fine-tuning time. However, existing adapters fail to capture intrinsic temporal relations among video frames or textual words. Moreover, they neglect the preservation of critical task-related information that flows from the raw video-language input into the adapter’s low-dimensional space. To address these issues, we first propose a novel REcurrent ADapter (READ) that employs recurrent computation to enable temporal modeling capability. Second, we propose Partial Video-Language Alignment (PVLA) objective via the use of partial optimal transport to maintain task-related information flowing into our READ modules. We validate our READ-PVLA framework through extensive experiments where READ-PVLA significantly outperforms all existing fine-tuning strategies on multiple low-resource temporal language grounding and video-language summarization benchmarks.

show abstract