AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Shoufa, Chen,; Ge, Chongjian; Zhan, Tong; Wang, Jiangliu; Song, Yibing; Wang, Jue; Luo, Ping

doi:10.48550/arxiv.2205.13535

Cited by 13 publications

(23 citation statements)

References 56 publications

(120 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Early works [39,40] introduce adapters to Computer Vision. [6] proposes a simple adapter AdaptFormer based on ViTs [9]. The Convpass [23] and ST-Adapter [35] utilize the spatial invariance and the temporal information of videos respectively.…”

Section: Parameter-efficient Transfer Learningmentioning

confidence: 99%

“…Apart from inferior performance, they also have the following drawbacks which make them inapplicable for PE-VTR. Some [6,16,23] are only designed for single modality (image or text) and ignore the temporal modeling and/or the interactions between multimodal features. Others bring in large parameter overhead, thus going against the purpose of PE-VTR [35].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multimodal Video Adapter for Parameter Efficient Video Text Retrieval

Zhang¹,

Jin²,

Gong³

et al. 2023

Preprint

View full text Add to dashboard Cite

State-of-the-art video-text retrieval (VTR) methods usually fully fine-tune the pre-trained model (e.g. CLIP) on specific datasets, which may suffer from substantial storage costs in practical applications since a separate model per task needs to be stored. To overcome this issue, we present the premier work on performing parameter-efficient VTR from the pre-trained model, i.e., only a small number of parameters are tunable while freezing the backbone. Towards this goal, we propose a new method dubbed Multimodal Video Adapter (MV-Adapter) for efficiently transferring the knowledge in the pre-trained CLIP from image-text to video-text. Specifically, MV-Adapter adopts bottleneck structures in both video and text branches and introduces two novel components. The first is a Temporal Adaptation Module employed in the video branch to inject global and local temporal contexts. We also learn weights calibrations to adapt to the dynamic variations across frames. The second is a Cross-Modal Interaction Module that generates weights for video/text branches through a shared parameter space, for better aligning between modalities. Thanks to above innovations, MV-Adapter can achieve on-par or better performance than standard fine-tuning with negligible parameters overhead. Notably, on five widely used VTR benchmarks (MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet), MV-Adapter consistently outperforms various competing methods in V2T/T2V tasks with large margins. Codes will be released.

show abstract

Section: Parameter-efficient Transfer Learningmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Multimodal Video Adapter for Parameter Efficient Video Text Retrieval

Zhang¹,

Jin²,

Gong³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Per-task video feature adaptation Existing pretrained ViL models ( [33,43]) are not designed for TAD, with a need for domain adaptation. Given the big model size and scarce labeled training data, we adopt the adapter [4] strategy so that only a fraction of parameters need to be learned. Concretely, our adapter unit is constructed by a down-projection linear layer, a non-linear activation function, followed by an up-projection linear layer in order.…”

Section: Multi-modal Prompt Meta-learningmentioning

confidence: 99%

Zero-Shot Temporal Action Detection via Vision-Language Prompting

Nag

Zhu

Song

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection (TAD) to new classes. The former adapts a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter requires no training examples by exploiting a semantic description of the new class. In this work, we introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD by leveraging few-shot support videos and new class names jointly. To tackle this problem, we further introduce a novel MUlti-modality PromPt mETa-learning (MUPPET) method. This is enabled by efficiently bridging pretrained vision and language models whilst maximally reusing already learned capacity. Concretely, we construct multi-modal prompts by mapping support videos into the textual token space of a vision-language model using a meta-learned adapter-equipped visual semantics tokenizer. To tackle large intra-class variation, we further design a query feature regulation scheme. Extensive experiments on ActivityNetv1.3 and THUMOS14 demonstrate that our MUPPET outperforms state-of-the-art alternative methods, often by a large margin. We also show that our MUPPET can be easily extended to tackle the few-shot object detection problem and again achieves the state-ofthe-art performance on MS-COCO dataset. The code will be available in https://github.com/sauradip/ MUPPET

show abstract

“…Swin Transformer (Liu et al, 2021) computes attention within a local window and adopts shifted windows for communication aggregation. More recently, efficient transfer learning is also explored in for vision Transformer (Bahng et al, 2022;Jia et al, 2022;Chen et al, 2022a). In this paper, we take the original ViT (Dosovitskiy et al, 2020) as the visual backbone with simple pooling layers, which are used to reduce the calculation burden, and more advanced structures may bring further gain.…”

Section: Related Workmentioning

confidence: 99%

CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer

Mu¹,

Chen²,

Ding³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Transformer has achieved great successes in learning vision and language representation, which is general across various downstream tasks. In visual control, learning transferable state representation that can transfer between different control tasks is important to reduce the training sample size. However, porting Transformer to sampleefficient visual control remains a challenging and unsolved problem. To this end, we propose a novel Control Transformer (CtrlFormer), possessing many appealing benefits that prior arts do not have. Firstly, CtrlFormer jointly learns self-attention mechanisms between visual tokens and policy tokens among different control tasks, where multitask representation can be learned and transferred without catastrophic forgetting. Secondly, we carefully design a contrastive reinforcement learning paradigm to train CtrlFormer, enabling it to achieve high sample efficiency, which is important in control problems. For example, in the DMControl benchmark, unlike recent advanced methods that failed by producing a zero score in the "Cartpole" task after transfer learning with 100k samples, CtrlFormer can achieve a state-of-the-art score 769 ±34 with only 100k samples, while maintaining the performance of previous tasks. The code and models are released in our project homepage.

show abstract

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Cited by 13 publications

References 56 publications

Multimodal Video Adapter for Parameter Efficient Video Text Retrieval

Multimodal Video Adapter for Parameter Efficient Video Text Retrieval

Zero-Shot Temporal Action Detection via Vision-Language Prompting

CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer

Contact Info

Product

Resources

About