“…Recently, a new trend of leveraging Transformer architecture [61] into the computer vision domain has emerged [12,19,28,33,36,57,71,83]. The Vision Transformer (ViT), which solely exploits the self-attention mechanism that inherits from the Transformer architecture, has set up many state-of-the-art (SOTA) records in image classifications [6,21,60], object detection [1,3,17,47], tracking [14,46,72], semantic segmentation [15,20,84], depth estimation [40,75], human pose estimation [39], 3D object animation [7], image retrieval [22], and image enhancement [8,44,74]. However, despite the impressive general results, ViTs have sacrificed lightweight model capacity, portability, and trainability in return for high accuracy.…”