2022
DOI: 10.48550/arxiv.2206.13559
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition

Abstract: Capitalizing on large pre-trained models for various downstream tasks of interest have recently emerged with promising performance. Due to the ever-growing model size, the standard full fine-tuning based task adaptation strategy becomes prohibitively costly in terms of model training and storage. This has led to a new research direction in parameter-efficient transfer learning. However, existing attempts typically focus on downstream tasks from the same modality (e.g., image understanding) of the pre-trained m… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
14
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(21 citation statements)
references
References 54 publications
0
14
0
Order By: Relevance
“…[6] proposes a simple adapter AdaptFormer based on ViTs [9]. The Convpass [23] and ST-Adapter [35] utilize the spatial invariance and the temporal information of videos respectively. Adapters are also widely used in NLP [16,18].…”
Section: Parameter-efficient Transfer Learningmentioning
confidence: 99%
See 2 more Smart Citations
“…[6] proposes a simple adapter AdaptFormer based on ViTs [9]. The Convpass [23] and ST-Adapter [35] utilize the spatial invariance and the temporal information of videos respectively. Adapters are also widely used in NLP [16,18].…”
Section: Parameter-efficient Transfer Learningmentioning
confidence: 99%
“…Afterwards, we adapt previous PETL methods including AdaptFormer [6], Convpass [23] and ST-adapter [35] to VTR as introduced in Sec. 5.1.…”
Section: Finementioning
confidence: 99%
See 1 more Smart Citation
“…Fine-tuning is a prevalent topic in centralized transfer learning, especially in this era of the "Foundation Model" (Bommasani et al, 2022). A significant line of work is to reduce the trainable parameter number, i.e., parameter-efficient fine-tuning (PEFT) (Chen et al, 2022a;Pan et al, 2022;Liu et al, 2022). This enables easier access and usage of pre-trained models by reducing the memory cost needed to conduct fine-tuning due to fewer computed gradients.…”
Section: Related Workmentioning
confidence: 99%
“…However, it is hard to get a pretrained model as powerful as CLIP in the video domain due to the unaffordable demands on computation resources and the difficulty of collecting video-text data pairs as large and diverse as image-text data. Instead of directly pursuing video-text pretrained models [17,27], a potential alternative solution that benefits video downstream tasks is to transfer the knowledge in image-text pretrained models to the video domain, which has attracted increasing attention in recent years [12,13,26,29,30,41].…”
Section: Introductionmentioning
confidence: 99%