2023
DOI: 10.48550/arxiv.2301.07868
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Multimodal Video Adapter for Parameter Efficient Video Text Retrieval

Abstract: State-of-the-art video-text retrieval (VTR) methods usually fully fine-tune the pre-trained model (e.g. CLIP) on specific datasets, which may suffer from substantial storage costs in practical applications since a separate model per task needs to be stored. To overcome this issue, we present the premier work on performing parameter-efficient VTR from the pre-trained model, i.e., only a small number of parameters are tunable while freezing the backbone. Towards this goal, we propose a new method dubbed Multimod… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 32 publications
(76 reference statements)
0
3
0
Order By: Relevance
“…Recently, (Jiang et al 2022) introduced a weight-share mechanism and adopted the query-scoring frame features reweighting method proposed in (Bain et al 2022) to boost performance. (Zhang et al 2023) proposed a temporal adaptation and cross-modal interaction modules. Prompt tuning (Lester, Al-Rfou, and Constant 2021) is another parameterefficient choice by introducing additional learnable parameters at the model's input.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, (Jiang et al 2022) introduced a weight-share mechanism and adopted the query-scoring frame features reweighting method proposed in (Bain et al 2022) to boost performance. (Zhang et al 2023) proposed a temporal adaptation and cross-modal interaction modules. Prompt tuning (Lester, Al-Rfou, and Constant 2021) is another parameterefficient choice by introducing additional learnable parameters at the model's input.…”
Section: Related Workmentioning
confidence: 99%
“…Previous video-language modeling methods (Liu et al 2022;Lei, Berg, and Bansal 2021;Yu et al 2021) employ pretrained Transformer models such as Unified Multimodal Transformer (UMT) (Liu et al 2022) and Vision-Guided BART (VG-BART) (Yu et al 2021), and fine-tune all the parameters of these models for every single task. This results in substantial storage overhead since each task demands storing a separate model (Zhang et al 2023). Moreover, because of the difficulty of collecting video-language data (Pan et al 2022), fully fine-tuning these over-parameterized models in low-resource scenarios, where limited training data is available, leads to instability and sub-optimal performance (Jiang et al 2022;Huang et al 2023).…”
Section: Introductionmentioning
confidence: 99%
“…To address these shortcomings, adapters are proposed as a parameter-efficient solution for finetuning video-language pretrained transformers (Jiang et al 2022;Zhang et al 2023;Yang et al 2023;Sung, Cho, and Bansal 2022;Chen et al 2022). The strategy is to add additional adaptation module to each layer of the pre-trained network and only the adaptation modules are trained during fine-tuning to improve the parameter-performance trade-off.…”
Section: Introductionmentioning
confidence: 99%