RegionViT: Regional-to-Local Attention for Vision Transformers

Chen, Chun-Fu; Panda, Rameswar; Fan, Quanfu

doi:10.48550/arxiv.2106.02689

Cited by 23 publications

(41 citation statements)

References 42 publications

(74 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, it still lacks the connections between distant patches, contradicting the intention of MSA. In contrast, our method follows a local-to-global paradigm which has achieved much success in vision tasks [7,9,56]. Our method not only preserves global receptive field at each block, but is also efficient in computation, as shown in the above analysis.…”

Section: Discussionmentioning

confidence: 95%

“…Comparison with Image-based ViTs. Our DualFormer can be also linked to several image-based transformers with a local-global stratified design, including RegionViT [7] and Twins-SVT [9]. The major difference between our approach and RegionViT lie in two aspects.…”

Section: Discussionmentioning

confidence: 99%

“…Following the intuition that tokens closer to each other are more likely to be correlated [9,30,36,60], we first perform LW-MSA at a fine-grained level to allow each patch to interact with its neighbors within a local window. This strategy has also been verified to be efficient and memory-friendly by recent studies [7,9,36,38,60]. Next, at the global level, a query patch attends to the full region of interest at a coarse granularity via GP-MSA.…”

Section: Introductionmentioning

confidence: 87%

See 2 more Smart Citations

DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition

Liang¹,

Zhou²,

Zimmermann³

et al. 2021

Preprint

View full text Add to dashboard Cite

While transformers have shown great potential on video recognition tasks with their strong capability of capturing long-range dependencies, they often suffer high computational costs induced by self-attention operation on the huge number of 3D tokens in a video. In this paper, we propose a new transformer architecture, termed DualFormer, which can effectively and efficiently perform space-time attention for video recognition. Specifically, our DualFormer stratifies the full space-time attention into dual cascaded levels, i.e., to first learn fine-grained local space-time interactions among nearby 3D tokens, followed by the capture of coarsegrained global dependencies between the query token and the coarse-grained global pyramid contexts. Different from existing methods that apply space-time factorization or restrict attention computations within local windows for improving efficiency, our local-global stratified strategy can well capture both short-and long-range spatiotemporal dependencies, and meanwhile greatly reduces the number of keys and values in attention computation to boost efficiency. Experimental results show the superiority of DualFormer on five video benchmarks against existing methods. In particular, DualFormer sets new state-of-the-art 82.9%/85.2% top-1 accuracy on Kinetics-400/600 with ∼1000G inference FLOPs which is at least 3.2× fewer than existing methods with similar performances.

show abstract

Section: Discussionmentioning

confidence: 95%

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 87%

See 1 more Smart Citation

DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition

Liang¹,

Zhou²,

Zimmermann³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…We report the performance on the validation subset, and use the mean average precision (AP) as the metric. We evaluate ELSA-Swin in Mask RCNN / Cascade Mask RCNN [2,33], which is a common practice in [6,70,71,79,87]. Following the common training protocol, we apply multi-scale training, scaling the shorter side of the input from 480 to 800 while keeping the longer side no more than 1333.…”

Section: Object Detection On Cocomentioning

confidence: 99%

“…As can be seen, ELSA-Swin-T and ELSA-Swin-S (noted as ELSA-T / ELSA-S) respectively improve the corresponding baseline by 1.9 AP and 1.8 AP in detection, both outperforming other methods within their group. Note that, unlike ViL [87] and RegionViT [6], ELSA-Swin does not modify the macro…”

Section: Object Detection On Cocomentioning

confidence: 99%

ELSA: Enhanced Local Self-Attention for Vision Transformer

Zhou¹,

Wang²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Self-attention is powerful in modeling long-range dependencies, but it is weak in local finer-level feature learning. The performance of local self-attention (LSA) is just on par with convolution and inferior to dynamic filters, which puzzles researchers on whether to use LSA or its counterparts, which one is better, and what makes LSA mediocre. To clarify these, we comprehensively investigate LSA and its counterparts from two sides: channel setting and spatial processing. We find that the devil lies in the generation and application of spatial attention, where relative position embeddings and the neighboring filter application are key factors. Based on these findings, we propose the enhanced local self-attention (ELSA) with Hadamard attention and the ghost head. Hadamard attention introduces the Hadamard product to efficiently generate attention in the neighboring case, while maintaining the high-order mapping. The ghost head combines attention maps with static matrices to increase channel capacity. Experiments demonstrate the effectiveness of ELSA. Without architecture / hyperparameter modification, drop-in replacing LSA with ELSA boosts Swin Transformer [48] by up to +1.4 on top-1 accuracy. ELSA also consistently benefits VOLO [83] from D1 to D5, where ELSA-VOLO-D5 achieves 87.2 on the ImageNet-1K without extra training images. In addition, we evaluate ELSA in downstream tasks. ELSA significantly improves the baseline by up to +1.9 box Ap / +1.3 mask Ap on the COCO, and by up to +1.9 mIoU on the ADE20K. Code is available at https://github.com/damo-cv/ELSA.

show abstract

Skeleton‐Guided Action Recognition with Multistream 3D Convolutional Neural Network for Elderly‐Care Robot

Zhang,

Zhou

2023

Advanced Intelligent Systems

View full text Add to dashboard Cite

With the arrival of a global aging society, elderly‐care robots are becoming more and more attractive and can provide better caring services through action recognition. This article presents a skeleton‐guided action recognition framework with multistream 3D convolutional neural network. Two parallel dual‐stream lightweight networks are proposed to enhance the feature extraction ability of human action and meanwhile reduce computation. Two different modes of skeleton input video are constructed to improve the recognition accuracy by decision fusion. The backbone networks adopt Resnet‐18, the feature fusion layer and sliding window mechanism are both designed, and two cross‐entropy losses are used to supervise their training. A dataset (named elder care action recognition (EC‐AR)) with different categories of action is built. The experimental results on HMDB‐51 and EC‐AR datasets both demonstrate that the proposed framework outperforms the existing methods. The developed method is also applied to a prototype of elderly‐care robots, and the test results in home scenarios show that it still has high recognition accuracy and good real‐time performance.

show abstract

RegionViT: Regional-to-Local Attention for Vision Transformers

Cited by 23 publications

References 42 publications

DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition

DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition

ELSA: Enhanced Local Self-Attention for Vision Transformer

Skeleton‐Guided Action Recognition with Multistream 3D Convolutional Neural Network for Elderly‐Care Robot

Contact Info

Product

Resources

About