2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.01214
|View full text |Cite
|
Sign up to set email alerts
|

Involution: Inverting the Inherence of Convolution for Visual Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
89
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 219 publications
(89 citation statements)
references
References 31 publications
0
89
0
Order By: Relevance
“…Unlike the recent hybrid architectures (e.g., Hybrid-ViT [14] and BoTNet [45]) that rely on convolutions for feature encoding, Outlooker proposes to use local pair-wise token similarities to encode fine-level features and spatial context into tokens features and hence is more effective and parameter-efficient. This also makes our model different from the Dynamic Convolution [60] and Involution [34] that generate input-dependent convolution kernels to encode the features.…”
Section: Related Workmentioning
confidence: 99%
“…Unlike the recent hybrid architectures (e.g., Hybrid-ViT [14] and BoTNet [45]) that rely on convolutions for feature encoding, Outlooker proposes to use local pair-wise token similarities to encode fine-level features and spatial context into tokens features and hence is more effective and parameter-efficient. This also makes our model different from the Dynamic Convolution [60] and Involution [34] that generate input-dependent convolution kernels to encode the features.…”
Section: Related Workmentioning
confidence: 99%
“…Weight sharing across spatial positions is mainly used in convolution, including normal convolution, depth-wise convolution and point-wise convolution. Weight sharing across channels is adopted in the attention unit [53], its variants [7,8,14,32,35,51,52,55,57,63], and token-mixer MLP in MLP-mixer [49] and ResMLP [50].…”
Section: Related Workmentioning
confidence: 99%
“…One is to learn homogeneous connection weights, e.g., SENet [26], dynamic convolution [30]. The other is to learn the weights for each region or each position (GENet [25], Lite-HRNet [61], Involution [32]). The attention unit in ViT or local ViT learns dynamic connection weights for each position.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations