2022
DOI: 10.1609/aaai.v36i2.20133
|View full text |Cite
|
Sign up to set email alerts
|

Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?

Abstract: Transformers have sprung up in the field of computer vision. In this work, we explore whether the core self-attention module in Transformer is the key to achieving excellent performance in image recognition. To this end, we build an attention-free network called sMLPNet based on the existing MLP-based vision models. Specifically, we replace the MLP module in the token-mixing step with a novel sparse MLP (sMLP) module. For 2D image tokens, sMLP applies 1D MLP along the axial directions and the parameters are sh… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
13
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 41 publications
(15 citation statements)
references
References 23 publications
0
13
0
Order By: Relevance
“…Can we find such a PWLNN by explicitly seeking a shallow PWLNN or implicitly regularizing the learning of a PWL-DNN? What are the differences and relations between PWLNNs and other kinds of NNs that address locally-dominant features 195 ?…”
Section: Discussionmentioning
confidence: 99%
“…Can we find such a PWLNN by explicitly seeking a shallow PWLNN or implicitly regularizing the learning of a PWL-DNN? What are the differences and relations between PWLNNs and other kinds of NNs that address locally-dominant features 195 ?…”
Section: Discussionmentioning
confidence: 99%
“…The module first extracts the long-term context dependencies of each modality using LSTM, and then further extracts the temporal vectors of each modality. After that, the sparse MLP [57] is used to mix the information of the temporal importance of the two modalities to obtain the attention vector with interaction information. Finally, the attention vector with interaction information is used to guide multimodal feature fusion.…”
Section: Tamf Modulementioning
confidence: 99%
“…2) The temporal features of the two modalities are concatenated to obtain the concatenated vector Concat vector. To interact with the timing information of the two modalities, we mix the information in the vertical and horizontal directions of the concatenated vector Concat vector through weight sharing and sparse connection in the sparse MLP [57], respectively, to obtain the mixed attention vector x mix.…”
Section: Tamf Modulementioning
confidence: 99%
“…sMLP block. Chuanxin Tang et al proposed Sparse MLP (sMLP) [21] based on the MLP-based vision model, replacing the MLP module in the token-mixing step with a new sMLP module. For a 2D image, sMLP applies 1D MLP along the image height and width, so the parameters are shared between rows or columns.…”
Section: Gpa-tunetmentioning
confidence: 99%
“…In order to solve these problems, we design a new attention mechanism GPA and cite the Sparse-MLP (sMLP) proposed by Chuanxin Tang et al [21]. We combine GPA with Transformer as encoder.…”
Section: Introductionmentioning
confidence: 99%