VSA: Learning Varied-Size Window Attention in Vision Transformers

Zhang, Qiming; Xu, Yufei; Zhang, Jing; Tao, Dacheng

doi:10.1007/978-3-031-19806-9_27

Cited by 35 publications

(14 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In all 3 groups, our model consistently outperforms other compared ones. For example, for models in the smallest group (∼2G FLOPs), our BiFormer-T achieves 81.4% top-1 accuracy, 1.4% bet-ter than the most competitive QuadTree-b1 [38]. For models in the second group (∼4G FLOPs), BiFormer-S achieves 83.8% top-1 accuracy.…”

Section: Image Classification On Imagenet-1kmentioning

confidence: 97%

“…The key observation which motivates our work is that the attentive region for different queries may differ significantly according to the visualization of pretrained ViT [15] and DETR [1]. As we achieve the goal of query-adaptive sparsity in a coarseto-fine manner, it shares some similarities with quad-tree attention [38]. Different from quad-tree attention, the goal of our bi-level routing attention is to locate a few most relevant key-value pairs, while quad-tree attention builds a to-ken pyramid and assembles messages from all levels of different granularities.…”

Section: Related Workmentioning

confidence: 99%

“…To demonstrate the computation efficiency of the proposed bi-level routing attention, we compare the throughputs of models using different attention mechanisms. Specifically, we replace the shift window attention modules in Swin-T [29] with quad-tree attention [38] modules to form QuadTree-STL, and with our bi-level routing attention modules to form BiFormer-STL. We then use the widely used timm [47] script to benchmark the training and inference throughput on a 32 GB Tesla V100 GPU with a batch size of 128 and image resolution of 224 × 224.…”

Section: B Throughput Comparisonmentioning

confidence: 99%

See 2 more Smart Citations

BiFormer: Vision Transformer with Bi-Level Routing Attention

Li¹,

Wang²,

Ke³

et al. 2023

Preprint

View full text Add to dashboard Cite

As the core building block of vision transformers, attention is a powerful tool to capture long-range dependency. However, such power comes at a cost: it incurs a huge computation burden and heavy memory footprint as pairwise token interaction across all spatial locations is computed. A series of works attempt to alleviate this problem by introducing handcrafted and content-agnostic sparsity into attention, such as restricting the attention operation to be inside local windows, axial stripes, or dilated windows. In contrast to these approaches, we propose a novel dynamic sparse attention via bi-level routing to enable a more flexible allocation of computations with content awareness. Specifically, for a query, irrelevant key-value pairs are first filtered out at a coarse region level, and then fine-grained token-to-token attention is applied in the union of remaining candidate regions (i.e., routed regions). We provide a simple yet effective implementation of the proposed bilevel routing attention, which utilizes the sparsity to save both computation and memory while involving only GPUfriendly dense matrix multiplications. Built with the proposed bi-level routing attention, a new general vision transformer, named BiFormer, is then presented. As BiFormer attends to a small subset of relevant tokens in a query adaptive manner without distraction from other irrelevant ones, it enjoys both good performance and high computational efficiency, especially in dense prediction tasks. Empirical results across several computer vision tasks such as image classification, object detection, and semantic segmentation verify the effectiveness of our design. Code is available at https://github.com/rayleizhu/BiFormer.

show abstract

Section: Image Classification On Imagenet-1kmentioning

confidence: 97%

Section: Related Workmentioning

confidence: 99%

Section: B Throughput Comparisonmentioning

confidence: 99%

See 1 more Smart Citation

BiFormer: Vision Transformer with Bi-Level Routing Attention

Li¹,

Wang²,

Ke³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Therefore, we consider integrating the Quadtree attention module into the skip connection. The Quadtree attention is an effective attention variant based on Transformer, 46 as shown in Figure 2. The module computes attention in a coarse-to-fine manner, and it is also able to capture long-range dependencies and local interactions, which can achieve better results and less computation in various vision tasks.…”

Section: Quadtree Attentionmentioning

confidence: 99%

QA-USTNet: Yarn-dyed fabric defect detection via U-shaped Swin Transformer Network based on Quadtree Attention

Zhang

Xiong

et al. 2023

Textile Research Journal

View full text Add to dashboard Cite

The detection and location of yarn-dyed fabric defects is a crucial and challenging problem in actual production scenarios. Recently, unsupervised fabric defect detection methods based on convolutional neural networks have attracted more attention. However, the convolutional neural networks often neglect to model the global receptive field of images, which further influence the defect detection ability of the model. In this article, we propose a U-shaped Swin Transformer network based on Quadtree attention framework for unsupervised yarn-dyed fabric defect detection. The method via U-shaped network based on Swin Transformer, the Swin Transformer adopts local attention to effectively learn features, and the U-shaped network realizes pixel-level reconstruction of images. The Quadtree attention is used to effectively capture the global features of the image, and model the global receptive field, and then better reconstruct the yarn-dyed fabric image. The improved Euclidean residual enhances the detection ability of unobvious defects, and obtains the final defect detection results. The proposed method effectively avoids the difficulty of collecting a large number of defective samples and manual labeling. Our method obtains 51.34% F1 and 38.30% intersection over union on the YDFID-1 dataset. Experimental results show that the proposed method can achieve higher accuracy of fabric defect detection and location compared with other methods.

show abstract

“…DynamicViT [25] devises a lightweight prediction module to estimate the importance score of each token and determine which tokens to be pruned dynamically. QuadTree Attention [26] builds token pyramids and computes attention according to the attention scores. This method skips irrelevant regions in the fine level if their corresponding coarse-level regions are not promising, thereby reducing the computational complexity from quadratic to linear.…”

Section: Dynamic Token Generationmentioning

confidence: 99%

SGDViT: Saliency-Guided Dynamic Vision Transformer for UAV Tracking

Yao¹,

Fu²,

Li³

et al. 2023

Preprint

View full text Add to dashboard Cite

Vision-based object tracking has boosted extensive autonomous applications for unmanned aerial vehicles (UAVs). However, the dynamic changes in flight maneuver and viewpoint encountered in UAV tracking pose significant difficulties, e.g., aspect ratio change, and scale variation. The conventional cross-correlation operation, while commonly used, has limitations in effectively capturing perceptual similarity and incorporates extraneous background information. To mitigate these limitations, this work presents a novel saliency-guided dynamic vision Transformer (SGDViT) for UAV tracking. The proposed method designs a new task-specific object saliency mining network to refine the cross-correlation operation and effectively discriminate foreground and background information. Additionally, a saliency adaptation embedding operation dynamically generates tokens based on initial saliency, thereby reducing the computational complexity of the Transformer architecture. Finally, a lightweight saliency filtering Transformer further refines saliency information and increases the focus on appearance information. The efficacy and robustness of the proposed approach have been thoroughly assessed through experiments on three widely-used UAV tracking benchmarks and real-world scenarios, with results demonstrating its superiority. The source code and demo videos are available at https://github.com/vision4robotics/SGDViT.

show abstract

VSA: Learning Varied-Size Window Attention in Vision Transformers

Cited by 35 publications

References 31 publications

BiFormer: Vision Transformer with Bi-Level Routing Attention

BiFormer: Vision Transformer with Bi-Level Routing Attention

QA-USTNet: Yarn-dyed fabric defect detection via U-shaped Swin Transformer Network based on Quadtree Attention

SGDViT: Saliency-Guided Dynamic Vision Transformer for UAV Tracking

Contact Info

Product

Resources

About