Active Token Mixer

Wei, Guoqiang; Zhang, Zhizheng; Lan, Cuiling; Lü, Yan; Chen, Zhibo

doi:10.48550/arxiv.2203.06108

Cited by 4 publications

(5 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…ActiveMLP 85 dynamically estimates the offset, rather than manually setting it like AS-MLP and CycleMLP do ( Figure 6 G). It first predicts the spatial locations of helpful contextual features along each direction at the channel level, and then finds and fuses them.…”

Section: Block Of Mlp Variantsmentioning

confidence: 99%

“…The approach based on shifting the feature map and channel projection further reduce computational complexity, i.e., the number of parameters is

with

FLOPs. MS-MLP 87 adds some depthwise convolutions, ActiveMLP 85 adds some channel projections, but both do not affect the overall complexity. Moreover, the number of weights is decoupled from the image resolution to no longer constrain these variants.…”

Section: Block Of Mlp Variantsmentioning

confidence: 99%

“…Object detection and semantic segmentation Some MLP-like variants 76,77,80,83,[85][86][87] pre-trained on ImageNet are transferred to downstream tasks, such as object detection and semantic segmentation. Such tasks are more challenging than classification due to involving multiple objects of interest in one input image.…”

Section: Image Classificationmentioning

confidence: 99%

See 2 more Smart Citations

Are we ready for a new paradigm shift? A survey on visual deep MLP

Liu

Tao

et al. 2022

Patterns

View full text Add to dashboard Cite

Section: Block Of Mlp Variantsmentioning

confidence: 99%

“…The approach based on shifting the feature map and channel projection further reduce computational complexity, i.e., the number of parameters is

with

Section: Block Of Mlp Variantsmentioning

confidence: 99%

See 1 more Smart Citation

Are we ready for a new paradigm shift? A survey on visual deep MLP

Liu

Tao

et al. 2022

Patterns

View full text Add to dashboard Cite

“…ViTs can achieve better accuracy/computation trade-off than conventional CNNs, where one of the working mechanisms is the alternation between spatial mixing (e.g., SA) and channel mixing (e.g., MLP) (Tolstikhin et al, 2021). Based on this, some works have explored different spatial mixing strategies in addition to self-attention, including spatial MLP (Tolstikhin et al, 2021;Tang et al, 2022;Wei et al, 2022) and depth-wise convolution (Ding et al, 2022;Guo et al, 2022). For an image X ∈ R H×W ×C , they first perform patch-wise image tokenization to obtain a tokenized image representation Z ∈ R N ×d , where N is the number of tokens and d is the number of channels.…”

Section: Generalizing Tl-align Beyond Vitsmentioning

confidence: 99%

“…The recent developments of vision transformers (ViTs) have revolutionized the computer vision field and set new state-of-the-arts in a variety of tasks, such as image classification (Dosovitskiy et al, 2020;Chu et al, 2021), object detection (Carion et al, 2020;Zhu et al, 2020;Dai et al, 2021a;b), and semantic segmentation (Li et al, 2017;Strudel et al, 2021;Zheng et al, 2021;Cheng et al, 2021). The successful structure of alternative spatial mixing and channel mixing in ViTs also motivates the arising of high-performance MLP-like deep architectures (Tolstikhin et al, 2021;Tang et al, 2022;Wei et al, 2022) and promotes the evolution of better CNNs (Ding et al, 2022;Guo et al, 2022). In addition to architecture designs, an improved training strategy can also greatly boost the performance of a trained deep model (Jiang et al, 2021;Touvron et al, 2022;.…”

Section: Introductionmentioning

confidence: 99%

Token-Label Alignment for Vision Transformers

Han¹,

Zheng²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs). They mix two images as inputs for training and assign them with a mixed label with the same ratio. While they are shown effective for vision transformers (ViTs), we identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies. We empirically observe that the contributions of input tokens fluctuate as forward propagating, which might induce a different mixing ratio in the output tokens. The training target computed by the original data mixing strategy can thus be inaccurate, resulting in less effective training. To address this, we propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token. We reuse the computed attention at each layer for efficient token-label alignment, introducing only negligible additional training costs. Extensive experiments demonstrate that our method improves the performance of ViTs on image classification, semantic segmentation, objective detection, and transfer learning tasks. Code is available at: https://github.com/Euphoria16/TL-Align.

show abstract