“…The strong capability of modeling long-range relation has facilitated Transformer in various vision tasks, including image classification [27,56,54], object detection [10,88,20], semantic/instance segmentation [76], video understanding [7,2,28,51], point cloud modeling [85,35], 3D Object Recognition [18] and even low-level processing [16,53,74]. Furthermore, transformers have advanced the vision recognition performance by a large-scale pretraining [19,60,12,30,37,68,64]. In such a situation, given the pre-trained Transformer models, which are more larger than the previously prevalent CNN backbones, one open question is how to fine-tune the big vision models so that they can be adapted into more specific down-stream tasks.…”