2021
DOI: 10.48550/arxiv.2111.12527
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

Abstract: Self-attention has become an integral component of the recent network architectures, e.g., Transformer, that dominate major image and video benchmarks. This is because self-attention can flexibly model long-range information. For the same reason, researchers make attempts recently to revive Multiple Layer Perceptron (MLP) and propose a few MLP-Like architectures, showing great potential. However, the current MLP-Like architectures are not good at capturing local details and lack progressive understanding of co… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
16
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2

Relationship

1
4

Authors

Journals

citations
Cited by 6 publications
(16 citation statements)
references
References 32 publications
(62 reference statements)
0
16
0
Order By: Relevance
“…Video Representation Learning. In the past decade, a large number of deep 2D [18,27,39,63,71] and 3D [8,17,21,68,72,77] models have been proposed to extract efficient spatial and temporal representations for video. Recently, inspired by the success of Transformer in NLP field [14,69], visual Transformers [5,12,42,49] are sprung up for video representation.…”
Section: Related Workmentioning
confidence: 99%
“…Video Representation Learning. In the past decade, a large number of deep 2D [18,27,39,63,71] and 3D [8,17,21,68,72,77] models have been proposed to extract efficient spatial and temporal representations for video. Recently, inspired by the success of Transformer in NLP field [14,69], visual Transformers [5,12,42,49] are sprung up for video representation.…”
Section: Related Workmentioning
confidence: 99%
“…Afterwards, Transformers spring up and make splendid breakthroughs on various vision tasks [13,14,15,16,17,18,19,20,21,22,23,24,25,26,27]. Most recently, the multi-layer perceptrons (MLPs) based architectures [28,29] have regained their light and been demonstrated capable of achieving stunning results on vision tasks [30,28,31,32,29,33,34]. A situation in which these three families of backbone architectures are contending has been formed.…”
Section: Introductionmentioning
confidence: 99%
“…Transformer-based architectures [12,13,14,35,16,15,36,37,17,38,39] perform message passing from other tokens into the query token based on the calculated pairwise attention weights, depending on the affinities between tokens in the embedding space. MLPbased architectures mostly enable information interaction through spatial fully connections across all tokens [28,30,40,34] or across certain tokens selected with hand-crafted rules in a deterministic manner [31,33,41,32,29,42]. However, the fully connection across all tokens makes the network incapable of coping with variable input resolutions, limiting the usage on downstream tasks (e.g., object detection and segmentation).…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Recent top-performing approaches to solving video understanding tasks are based on supervised learning with a large amount of labeled data for training. Due to the strong data fitting capacity of deep convolutional neural networks, competitive performance can be achieved for recognizing actions in videos (Carreira et al 2017;Zhang et al 2021). One of the key factors for the success may owe to the strong correlation between action class and object/background known as representation bias in (Li et al 2018;Choi et al 2019).…”
Section: Introductionmentioning
confidence: 99%