2022
DOI: 10.48550/arxiv.2205.13535
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Abstract: Although the pre-trained Vision Transformers (ViTs) achieved great success in computer vision, adapting a ViT to various image and video tasks is challenging because of its heavy computation and storage burdens, where each model needs to be independently and comprehensively fine-tuned to different tasks, limiting its transferability in different domains. To address this challenge, we propose an effective adaptation approach for Transformer, namely AdaptFormer, which can adapt the pre-trained ViTs into many dif… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
21
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 13 publications
(23 citation statements)
references
References 56 publications
(120 reference statements)
0
21
0
Order By: Relevance
“…Early works [39,40] introduce adapters to Computer Vision. [6] proposes a simple adapter AdaptFormer based on ViTs [9]. The Convpass [23] and ST-Adapter [35] utilize the spatial invariance and the temporal information of videos respectively.…”
Section: Parameter-efficient Transfer Learningmentioning
confidence: 99%
See 1 more Smart Citation
“…Early works [39,40] introduce adapters to Computer Vision. [6] proposes a simple adapter AdaptFormer based on ViTs [9]. The Convpass [23] and ST-Adapter [35] utilize the spatial invariance and the temporal information of videos respectively.…”
Section: Parameter-efficient Transfer Learningmentioning
confidence: 99%
“…Apart from inferior performance, they also have the following drawbacks which make them inapplicable for PE-VTR. Some [6,16,23] are only designed for single modality (image or text) and ignore the temporal modeling and/or the interactions between multimodal features. Others bring in large parameter overhead, thus going against the purpose of PE-VTR [35].…”
Section: Introductionmentioning
confidence: 99%
“…Per-task video feature adaptation Existing pretrained ViL models ( [33,43]) are not designed for TAD, with a need for domain adaptation. Given the big model size and scarce labeled training data, we adopt the adapter [4] strategy so that only a fraction of parameters need to be learned. Concretely, our adapter unit is constructed by a down-projection linear layer, a non-linear activation function, followed by an up-projection linear layer in order.…”
Section: Multi-modal Prompt Meta-learningmentioning
confidence: 99%
“…Swin Transformer (Liu et al, 2021) computes attention within a local window and adopts shifted windows for communication aggregation. More recently, efficient transfer learning is also explored in for vision Transformer (Bahng et al, 2022;Jia et al, 2022;Chen et al, 2022a). In this paper, we take the original ViT (Dosovitskiy et al, 2020) as the visual backbone with simple pooling layers, which are used to reduce the calculation burden, and more advanced structures may bring further gain.…”
Section: Related Workmentioning
confidence: 99%