ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

Zhang, Qiming; Xu, Yufei; Zhang, Jing; Tao, Dacheng

doi:10.1007/s11263-022-01739-w

Cited by 94 publications

(28 citation statements)

References 64 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To entail a fair comparison, we keep the same data augmentation and training settings as the other vision transformers as far as possible. The competitors are all competitive vision transformers, including DeiT [2], PVT [3], T2T-ViT [19], TNT [20], CViT [21], Twins [22], Swin [4], NesT [23], CvT [9], ViL [24], CAT [5], ResT [25], TransCNN [26], Shuffle [27], BoTNet [28], Re-gionViT [29], ViTAEv2 [30], MPViT [31], ScalableViT [32], DaViT [33], and CoAtNet [34].…”

Section: Methodsmentioning

confidence: 99%

Couplformer: Rethinking Vision Transformer with Coupling Attention

Lü

Wang

Shen

et al. 2023

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

View full text Add to dashboard Cite

While features of different scales are perceptually important to visual inputs, existing vision transformers do not yet take advantage of them explicitly. To this end, we first propose a cross-scale vision transformer, CrossFormer. It introduces a cross-scale embedding layer (CEL) and a long-short distance attention (LSDA). On the one hand, CEL blends each token with multiple patches of different scales, providing the self-attention module itself with cross-scale features. On the other hand, LSDA splits the self-attention module into a short-distance one and a long-distance counterpart, which not only reduces the computational burden but also keeps both small-scale and large-scale features in the tokens. Moreover, through experiments on CrossFormer, we observe another two issues that affect vision transformers' performance, i.e., the enlarging self-attention maps and amplitude explosion. Thus, we further propose a progressive group size (PGS) paradigm and an amplitude cooling layer (ACL) to alleviate the two issues, respectively. The CrossFormer incorporating with PGS and ACL is called CrossFormer++. Extensive experiments show that CrossFormer++ outperforms the other vision transformers on image classification, object detection, instance segmentation, and semantic segmentation tasks. The code will be available at: https://github.com/cheerss/CrossFormer.

show abstract

Section: Methodsmentioning

confidence: 99%

Couplformer: Rethinking Vision Transformer with Coupling Attention

Lü

Wang

Shen

et al. 2023

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

View full text Add to dashboard Cite

show abstract

“…A possible reason is that MSAs at the end of each stage act as spatial smoothing and aggregation [20], thus neglecting details unavoidably. To this end, we propose the I-PLDE module, a parallel branch emphasizing local detail on the top of the vertical hybrid design, inspired by the "divide-and-conquer" idea in [30,32].I-PLDE consists of a 1x1 convolution to match hidden dimension with its parallel branch, three stacked depth-wise convolution layers and an window embedding operation. SiLU is used for non-linear activation following the convention in [30,32].…”

Section: Dudornext: Towards Hybridizing Cnns and Vitsmentioning

confidence: 99%

“…To this end, we propose the I-PLDE module, a parallel branch emphasizing local detail on the top of the vertical hybrid design, inspired by the "divide-and-conquer" idea in [30,32].I-PLDE consists of a 1x1 convolution to match hidden dimension with its parallel branch, three stacked depth-wise convolution layers and an window embedding operation. SiLU is used for non-linear activation following the convention in [30,32]. The output of I-PLDE F CE i is added after W-MSA for preserving details.…”

Section: Dudornext: Towards Hybridizing Cnns and Vitsmentioning

confidence: 99%

“…As the success of Transformer is now indisputable in computer vision, Transformer has shown great potential for undersampled MRI reconstruction as well [13,33,8,7,19]. Yet performant, ViTs have not fully substituted CNNs as ViTs require a larger amount of training data due to a low inductive bias and have longer training schedules [30,32]. Furthermore, CNNs and ViTs have different emphases.…”

Section: Introductionmentioning

confidence: 99%

“…In recent works in decomposing Transformer from the basic theory [20] to empirical network design [18,28], a potential direction for modernizing deep learning models arises: hybridizing CNNs and ViTs. While several work [30,32,20] in computer vision show the effectiveness of hybrid structures, there is no research systematically studying a hybrid model for MRI reconstruction.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

DuDoRNeXt: A hybrid model for dual-domain undersampled MRI reconstruction

Gao¹,

Zhou²

2023

Preprint

View full text Add to dashboard Cite

Undersampled MRI reconstruction is crucial for accelerating clinical scanning procedures. Recent deep learning methods for MRI reconstruction adopt CNN or ViT as backbone, which lack in utilizing the complementary properties of CNN and ViT. In this paper, we propose DuDoRNeXt, whose backbone hybridizes CNN and ViT in an domainspecific, intra-stage way. Besides our hybrid vertical layout design, we introduce domain-specific modules for dual-domain reconstruction, namely image-domain parallel local detail enhancement and k-space global initialization. We evaluate different conventions of MRI reconstruction including image-domain, k-space-domain, and dual-domain reconstruction with a reference protocol on the IXI dataset and an in-house multicontrast dataset. DuDoRNeXt achieves significant improvements over competing deep learning methods.

show abstract

VSA: Learning Varied-Size Window Attention in Vision Transformers

Zhang

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

Cited by 94 publications

References 64 publications

Couplformer: Rethinking Vision Transformer with Coupling Attention

Couplformer: Rethinking Vision Transformer with Coupling Attention

DuDoRNeXt: A hybrid model for dual-domain undersampled MRI reconstruction

VSA: Learning Varied-Size Window Attention in Vision Transformers

Contact Info

Product

Resources

About