2022
DOI: 10.48550/arxiv.2203.06108
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Active Token Mixer

Abstract: This paper presents ActiveMLP, a general MLP-like backbone for computer vision. The three existing dominant network families, i.e., CNNs, Transformers and MLPs, differ from each other mainly in the ways to fuse contextual information into a given token, leaving the design of more effective token-mixing mechanisms at the core of backbone architecture development. In ActiveMLP, we propose an innovative token-mixer, dubbed Active Token Mixer (ATM), to actively incorporate contextual information from other tokens … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(5 citation statements)
references
References 50 publications
0
5
0
Order By: Relevance
“…ActiveMLP 85 dynamically estimates the offset, rather than manually setting it like AS-MLP and CycleMLP do ( Figure 6 G). It first predicts the spatial locations of helpful contextual features along each direction at the channel level, and then finds and fuses them.…”
Section: Block Of Mlp Variantsmentioning
confidence: 99%
See 2 more Smart Citations
“…ActiveMLP 85 dynamically estimates the offset, rather than manually setting it like AS-MLP and CycleMLP do ( Figure 6 G). It first predicts the spatial locations of helpful contextual features along each direction at the channel level, and then finds and fuses them.…”
Section: Block Of Mlp Variantsmentioning
confidence: 99%
“…The approach based on shifting the feature map and channel projection further reduce computational complexity, i.e., the number of parameters is with FLOPs. MS-MLP 87 adds some depthwise convolutions, ActiveMLP 85 adds some channel projections, but both do not affect the overall complexity. Moreover, the number of weights is decoupled from the image resolution to no longer constrain these variants.…”
Section: Block Of Mlp Variantsmentioning
confidence: 99%
See 1 more Smart Citation
“…ViTs can achieve better accuracy/computation trade-off than conventional CNNs, where one of the working mechanisms is the alternation between spatial mixing (e.g., SA) and channel mixing (e.g., MLP) (Tolstikhin et al, 2021). Based on this, some works have explored different spatial mixing strategies in addition to self-attention, including spatial MLP (Tolstikhin et al, 2021;Tang et al, 2022;Wei et al, 2022) and depth-wise convolution (Ding et al, 2022;Guo et al, 2022). For an image X ∈ R H×W ×C , they first perform patch-wise image tokenization to obtain a tokenized image representation Z ∈ R N ×d , where N is the number of tokens and d is the number of channels.…”
Section: Generalizing Tl-align Beyond Vitsmentioning
confidence: 99%
“…The recent developments of vision transformers (ViTs) have revolutionized the computer vision field and set new state-of-the-arts in a variety of tasks, such as image classification (Dosovitskiy et al, 2020;Chu et al, 2021), object detection (Carion et al, 2020;Zhu et al, 2020;Dai et al, 2021a;b), and semantic segmentation (Li et al, 2017;Strudel et al, 2021;Zheng et al, 2021;Cheng et al, 2021). The successful structure of alternative spatial mixing and channel mixing in ViTs also motivates the arising of high-performance MLP-like deep architectures (Tolstikhin et al, 2021;Tang et al, 2022;Wei et al, 2022) and promotes the evolution of better CNNs (Ding et al, 2022;Guo et al, 2022). In addition to architecture designs, an improved training strategy can also greatly boost the performance of a trained deep model (Jiang et al, 2021;Touvron et al, 2022;.…”
Section: Introductionmentioning
confidence: 99%