ConvMLP: Hierarchical Convolutional MLPs for Vision

Li, Jiachen; Hassani, Ali; Walton, Steven A.

doi:10.48550/arxiv.2109.04454

Cited by 19 publications

(29 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CycleMLP (Chen et al, 2021b) takes pseudo-kernels and sample tokens from different spatial locations for mixing. ConvMLP (Li et al, 2021) incorporates convolution layers and a pyramid structure to achieve local token mixing. Hire-MLP (Guo et al, 2021) rearranges tokens across local regions to gain performance and computational efficiency.…”

Section: Related Workmentioning

confidence: 99%

Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs

Zheng¹,

He²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Token-mixing multi-layer perceptron (MLP) models have shown competitive performance in computer vision tasks with a simple architecture and relatively small computational cost. Their success in maintaining computation efficiency is mainly attributed to avoiding the use of self-attention that is often computationally heavy, yet this is at the expense of not being able to mix tokens both globally and locally. In this paper, to exploit both global and local dependencies without self-attention, we present Mix-Shift-MLP (MS-MLP) which makes the size of the local receptive field used for mixing increase in respect to the amount of spatial shifting. In addition to conventional mixing and shifting techniques, MS-MLP mixes both neighboring and distant tokens from fine-to coarse-grained levels and then gathers them via a shifting operation. This directly contributes to the interactions between global and local tokens. Being simple to implement, MS-MLP achieves competitive performance in multiple vision benchmarks. For example, an MS-MLP with 85 million parameters achieves 83.8% top-1 classification accuracy on ImageNet-1K. Moreover, by combining MS-MLP with state-of-the-art Vision Transformers such as the Swin Transformer, we show MS-MLP achieves further improvements on three different model scales, e.g., by 0.5% on ImageNet-1K classification with Swin-B. The code is available at: https://github.com/JegZheng/MS-MLP.

show abstract

Section: Related Workmentioning

confidence: 99%

Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs

Zheng¹,

He²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Based on these pioneering studies, concurrent papers [5,11,18,23,25,28,44,56,57] address new issues and potentials in MLP-like architectures. VisionPermutator [18] effectively preserves spatial dimensions of the input tokens by separately processing token representation along the different dimensions.…”

Section: Related Workmentioning

confidence: 99%

“…Beyond the well established realm of CNNs and transformer, MLP-Mixer [43] and Synthesizer [41] propose a new architecture that exclusively uses MLPs. Based on these pioneering studies [41,43], concurrent works [5,18,23,44] have been recently introduced. For instance, ResMLP [44] emphasizes that MLP-like architectures can take inputs of arbitrary length.…”

Section: Introductionmentioning

confidence: 99%

PointMixer: MLP-Mixer for Point Cloud Understanding

Choe¹,

Park²,

Rameau³

et al. 2021

Preprint

View full text Add to dashboard Cite

MLP-Mixer has newly appeared as a new challenger against the realm of CNNs and transformer. Despite its simplicity compared to transformer, the concept of channelmixing MLPs and token-mixing MLPs achieves noticeable performance in visual recognition tasks. Unlike images, point clouds are inherently sparse, unordered and irregular, which limits the direct use of MLP-Mixer for point cloud understanding. In this paper, we propose PointMixer, a universal point set operator that facilitates information sharing among unstructured 3D points. By simply replacing token-mixing MLPs with a softmax function, PointMixer can "mix" features within/between point sets. By doing so, PointMixer can be broadly used in the network as interset mixing, intra-set mixing, and pyramid mixing. Extensive experiments show the competitive or superior performance of PointMixer in semantic segmentation, classification, and point reconstruction against transformer-based methods. Code will be released soon.

show abstract

“…There is another special variant that uses only channel projection, called ConvMLP [101]. Its authors call it a hierarchical Convolutional MLP which is a light-weight, stage-wise, co-design of convolution layers.…”

Section: Yu Et Al From Baidumentioning

confidence: 99%

Are we ready for a new paradigm shift? A Survey on Visual Deep MLP

Liu¹,

Li²,

Tao³

et al. 2021

Preprint

View full text Add to dashboard Cite

Multilayer perceptron (MLP), as the first neural network structure to appear, was a big hit. But constrained by the hardware computing power and the size of the datasets, it once sank for tens of years. During this period, we have witnessed a paradigm shift from manual feature extraction to the CNN with local receptive field, and further to the Transformer with global receptive field based on selfattention mechanism. And this year (2021), with the introduction of MLP-Mixer, MLP has re-entered the limelight and has attracted extensive research from the computer vision community. Compare to the conventional MLP, it gets deeper but changes the input from full flattening to patch flattening. Given its high performance and less need for vision-specific inductive bias, the community can't help but wonder, Will deep MLP, the simplest structure with global receptive field but no attention, become a new computer vision paradigm? To answer this question, this survey aims to provide a comprehensive overview of the recent development of deep MLP models in vision. Specifically, we review these MLPs in detail, from the subtle sub-module design to the global network structure. We compare the receptive field, computational complexity, and other properties of different network designs in order to understand the development path of MLPs clearly. The investigation shows that MLPs' resolution-sensitivity and computational densities remain unresolved, and pure MLPs are gradually evolving towards CNN-like. We suggest that the current data volume and computational power are not ready to embrace pure MLPs, and artificial visual guidance remains important. Finally, we provide our viewpoint about open research directions and potential future works. We hope this effort will ignite further interest in the community and encourage better visual tailored design for the neural network in the future.

show abstract

ConvMLP: Hierarchical Convolutional MLPs for Vision

Cited by 19 publications

References 39 publications

Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs

Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs

PointMixer: MLP-Mixer for Point Cloud Understanding

Are we ready for a new paradigm shift? A Survey on Visual Deep MLP

Contact Info

Product

Resources

About