Patches Are All You Need?

Trockman, Asher; Kolter, J. Zico

doi:10.48550/arxiv.2201.09792

Cited by 77 publications

(114 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…ConvMixer [90] uses up to 9×9 convolutions to replace the "mixer" component of ViTs [35] or MLPs [87,88]. MetaFormer [108] suggests pooling layer is an alternate to self-attention.…”

Section: Concurrent Workmentioning

confidence: 99%

Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs

Ding¹,

Zhang²,

Zhou³

et al. 2022

Preprint

View full text Add to dashboard Cite

We revisit large kernel design in modern convolutional neural networks (CNNs). Inspired by recent advances in vision transformers (ViTs), in this paper, we demonstrate that using a few large convolutional kernels instead of a stack of small kernels could be a more powerful paradigm. We suggested five guidelines, e.g., applying re-parameterized large depth-wise convolutions, to design efficient highperformance large-kernel CNNs. Following the guidelines, we propose RepLKNet, a pure CNN architecture whose kernel size is as large as 31×31, in contrast to commonly used 3×3. RepLKNet greatly closes the performance gap between CNNs and ViTs, e.g., achieving comparable or superior results than Swin Transformer on ImageNet and a few typical downstream tasks, with lower latency. RepLKNet also shows nice scalability to big data and large models, obtaining 87.8% top-1 accuracy on ImageNet and 56.0% mIoU on ADE20K, which is very competitive among the state-of-the-arts with similar model sizes. Our study further reveals that, in contrast to small-kernel CNNs, largekernel CNNs have much larger effective receptive fields and higher shape bias rather than texture bias. Code & models at https://github.com/megvii-research/ RepLKNet.

show abstract

“…ConvMixer [90] uses up to 9×9 convolutions to replace the "mixer" component of ViTs [35] or MLPs [87,88]. MetaFormer [108] suggests pooling layer is an alternate to self-attention.…”

Section: Concurrent Workmentioning

confidence: 99%

Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs

Ding¹,

Zhang²,

Zhou³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Every convolution is followed by a ReLU activation and BatchNorm (BN). It has been demonstrated that the depthwise separable convolution works best with the largesized convolution [17]. To avoid gradient vanishing problem in deeper layers, each DC layer is designed by using the skip connection method.…”

Section: Deeper Convolution Blockmentioning

confidence: 99%

DCSAU-Net: A Deeper and More Compact Split-Attention U-Net for Medical Image Segmentation

Xu¹,

Duan²,

He³

2022

Preprint

View full text Add to dashboard Cite

Image segmentation is a key step for medical image analysis. Approaches based on deep neural networks have been introduced and performed more reliable results than traditional image processing methods. However, many models focus on one medical image application and still show limited abilities to work with complex images. In this paper, we propose a novel deeper and more compact split-attention u-shape network (DCSAU-Net) that extracts useful features using multi-scale combined split-attention and deeper depth wise convolution. We evaluate the proposed model on CVC-ClinicDB, 2018 Data Science Bowl, ISIC-2018 and SegPC-2021 datasets. As a result, DCSAU-Net displays better performance than other state-of-the-art (SOTA) methods in terms of the mean Intersection over Union (mIoU) and F1-socre. More significantly, the proposed model demonstrate better segmentation performance on challenging images.

show abstract

“…This work extracts large context information for matching via leveraging recent advances in Vision Transformers [11,14,29]. Methods leveraging Transformers' ability of modeling long-term dependencies have outperformed convolutional neural networks in various high-level computer vision tasks [14,43]. Inspired by these, Jiang et al [26] introduced an attentionbased module to resolve occlusions for optical flow estimation.…”

Section: Related Workmentioning

confidence: 99%

“…Moreover, POLA can be viewed as a generalization of per pixel overlapping attention that has been explored in [19,34]. Compared with the per-pixel one, POLA enjoys at least three advantages: 1) consumes less memory, 2) can be efficiently implemented in existing deep learning platforms, and 3) arranges features by patch, which may provide better performance as suggested in recent research [14,29,43].…”

Section: Attention In Transformermentioning

confidence: 99%

Global Matching with Overlapping Attention for Optical Flow Estimation

Zhao¹,

Zhao²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Optical flow estimation is a fundamental task in computer vision. Recent direct-regression methods using deep neural networks achieve remarkable performance improvement. However, they do not explicitly capture long-term motion correspondences and thus cannot handle large motions effectively. In this paper, inspired by the traditional matching-optimization methods where matching is introduced to handle large displacements before energy-based optimizations, we introduce a simple but effective global matching step before the direct regression and develop a learning-based matching-optimization framework, namely GMFlowNet. In GMFlowNet, global matching is efficiently calculated by applying argmax on 4D cost volumes. Additionally, to improve the matching quality, we propose patch-based overlapping attention to extract large context features. Extensive experiments demonstrate that GM-FlowNet outperforms RAFT, the most popular optimizationonly method, by a large margin and achieves state-ofthe-art performance on standard benchmarks. Thanks to the matching and overlapping attention, GMFlowNet obtains major improvements on the predictions for textureless regions and large motions. Our code is made publicly available at https://github.com/xiaofeng94/ GMFlowNet.

show abstract

Patches Are All You Need?

Cited by 77 publications

References 14 publications

Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs

Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs

DCSAU-Net: A Deeper and More Compact Split-Attention U-Net for Medical Image Segmentation

Global Matching with Overlapping Attention for Optical Flow Estimation

Contact Info

Product

Resources

About