Involution: Inverting the Inherence of Convolution for Visual Recognition

Li, Duo; Hu, Jie; Wang, Changhu; Li, Xiangtai; She, Qi; Zhu, Lei; Zhang, Tong; Chen, Qifeng

doi:10.1109/cvpr46437.2021.01214

Cited by 219 publications

(89 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Unlike the recent hybrid architectures (e.g., Hybrid-ViT [14] and BoTNet [45]) that rely on convolutions for feature encoding, Outlooker proposes to use local pair-wise token similarities to encode fine-level features and spatial context into tokens features and hence is more effective and parameter-efficient. This also makes our model different from the Dynamic Convolution [60] and Involution [34] that generate input-dependent convolution kernels to encode the features.…”

Section: Related Workmentioning

confidence: 99%

VOLO: Vision Outlooker for Visual Recognition

Li¹,

Hou²,

Jiang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Visual recognition has been dominated by convolutional neural networks (CNNs) for years. Though recently the prevailing vision transformers (ViTs) have shown great potential of self-attention based models in ImageNet classification, their performance is still inferior to that of the latest SOTA CNNs if no extra data are provided. In this work, we try to close the performance gap and demonstrate that attention-based models are indeed able to outperform CNNs. We find a major factor limiting the performance of ViTs for ImageNet classification is their low efficacy in encoding fine-level features into the token representations. To resolve this, we introduce a novel outlook attention and present a simple and general architecture, termed Vision Outlooker (VOLO). Unlike self-attention that focuses on global dependency modeling at a coarse level, the outlook attention efficiently encodes finer-level features and contexts into tokens, which is shown to be critically beneficial to recognition performance but largely ignored by the self-attention. Experiments show that our VOLO achieves 87.1% top-1 accuracy on ImageNet-1K classification, which is the first model exceeding 87% accuracy on this competitive benchmark, without using any extra training data. In addition, the pre-trained VOLO transfers well to downstream tasks, such as semantic segmentation. We achieve 84.3% mIoU score on the cityscapes validation set and 54.3% on the ADE20K validation set. Code is available at https://github.com/sail-sg/volo.

show abstract

Section: Related Workmentioning

confidence: 99%

VOLO: Vision Outlooker for Visual Recognition

Li¹,

Hou²,

Jiang³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Weight sharing across spatial positions is mainly used in convolution, including normal convolution, depth-wise convolution and point-wise convolution. Weight sharing across channels is adopted in the attention unit [53], its variants [7,8,14,32,35,51,52,55,57,63], and token-mixer MLP in MLP-mixer [49] and ResMLP [50].…”

Section: Related Workmentioning

confidence: 99%

“…One is to learn homogeneous connection weights, e.g., SENet [26], dynamic convolution [30]. The other is to learn the weights for each region or each position (GENet [25], Lite-HRNet [61], Involution [32]). The attention unit in ViT or local ViT learns dynamic connection weights for each position.…”

Section: Related Workmentioning

confidence: 99%

“…In our experiment, we use weights shared across positions for the dynamic version of depth-wise convolution-based networks. This may be enhanced by using weights not shared across positions, such as GENet [25], Involution [32], and Lite-HRNet [61].…”

Section: Setting Detailsmentioning

confidence: 99%

“…We made an initial investigation (inhomogeneous dynamic): generate local weights for each position using two 1 × 1 convolutions to predict the weights shared across each group of channels, which is a generalization of homogeneous dynamic weight prediction and similar to [32,56,61], and share the weights within each group of channels. The results are shown in Table 11.…”

Section: Setting Detailsmentioning

confidence: 99%

See 2 more Smart Citations

On the Connection between Local Attention and Dynamic Depth-wise Convolution

Han¹,

Fan²,

Dai³

et al. 2021

Preprint

View full text Add to dashboard Cite

Vision Transformer (ViT) attains state-of-the-art performance in visual recognition, and the variant, Local Vision Transformer, makes further improvements. The major component in Local Vision Transformer, local attention, performs the attention separately over small local windows. We rephrase local attention as a channel-wise locally-connected layer and analyze it from two network regularization manners, sparse connectivity and weight sharing, as well as weight computation. Sparse connectivity: there is no connection across channels, and each position is connected to the positions within a small local window. Weight sharing: the connection weights for one position are shared across channels or within each group of channels. Dynamic weight: the connection weights are dynamically predicted according to each image instance. We point out that local attention resembles depth-wise convolution and its dynamic version in sparse connectivity. The main difference lies in weight sharingdepth-wise convolution shares connection weights (kernel weights) across spatial positions. We empirically observe that the models based on depth-wise convolution and the dynamic variant with lower computation complexity perform on-par with or sometimes slightly better than Swin Transformer, an instance of Local Vision Transformer, for ImageNet classification, COCO object detection and ADE semantic segmentation. These observations suggest that Local Vision Transformer takes advantage of two regularization forms and dynamic weight to increase the network capacity.

show abstract

LiteTrans: Reconstruct Transformer with Convolution for Medical Image Segmentation

Quan

2021

Bioinformatics Research and Applications

View full text Add to dashboard Cite

Involution: Inverting the Inherence of Convolution for Visual Recognition

Cited by 219 publications

References 31 publications

VOLO: Vision Outlooker for Visual Recognition

VOLO: Vision Outlooker for Visual Recognition

On the Connection between Local Attention and Dynamic Depth-wise Convolution

LiteTrans: Reconstruct Transformer with Convolution for Medical Image Segmentation

Contact Info

Product

Resources

About