MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

Zhang, David Junhao; Li, Kunchang; Wang, Yali; Chen, Yunpeng; Chandra, Shashwat; Qiao, Yu; Luoqi, Liu,; Shou, Mike Zheng

doi:10.48550/arxiv.2111.12527

Cited by 6 publications

(16 citation statements)

References 32 publications

(62 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Video Representation Learning. In the past decade, a large number of deep 2D [18,27,39,63,71] and 3D [8,17,21,68,72,77] models have been proposed to extract efficient spatial and temporal representations for video. Recently, inspired by the success of Transformer in NLP field [14,69], visual Transformers [5,12,42,49] are sprung up for video representation.…”

Section: Related Workmentioning

confidence: 99%

Video-Text Pre-training with Learned Regions

Yan

Shou²,

Ge³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs via aligning the semantics between visual and textual information. State-of-the-art approaches extract visual features from raw pixels in an end-to-end fashion. However, these methods operate at frame-level directly and thus overlook the spatiotemporal structure of objects in video, which yet has a strong synergy with nouns in textual descriptions. In this work, we propose a simple yet effective module for videotext representation learning, namely RegionLearner, which can take into account the structure of objects during pretraining on large-scale video-text pairs. Given a video, our module (1) first quantizes visual features into semantic clusters, then (2) generates learnable masks and uses them to aggregate the features belonging to the same semantic region, and finally (3) models the interactions between different aggregated regions. In contrast to using offthe-shelf object detectors, our proposed module does not require explicit supervision and is much more computationally efficient. We pre-train the proposed approach on the public WebVid2M and CC3M datasets. Extensive evaluations on four downstream video-text retrieval benchmarks clearly demonstrate the effectiveness of our RegionLearner. The code will be available at https://github.com/ ruiyan1995/Region_Learner.

show abstract

Section: Related Workmentioning

confidence: 99%

Video-Text Pre-training with Learned Regions

Yan

Shou²,

Ge³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Afterwards, Transformers spring up and make splendid breakthroughs on various vision tasks [13,14,15,16,17,18,19,20,21,22,23,24,25,26,27]. Most recently, the multi-layer perceptrons (MLPs) based architectures [28,29] have regained their light and been demonstrated capable of achieving stunning results on vision tasks [30,28,31,32,29,33,34]. A situation in which these three families of backbone architectures are contending has been formed.…”

Section: Introductionmentioning

confidence: 99%

“…Transformer-based architectures [12,13,14,35,16,15,36,37,17,38,39] perform message passing from other tokens into the query token based on the calculated pairwise attention weights, depending on the affinities between tokens in the embedding space. MLPbased architectures mostly enable information interaction through spatial fully connections across all tokens [28,30,40,34] or across certain tokens selected with hand-crafted rules in a deterministic manner [31,33,41,32,29,42]. However, the fully connection across all tokens makes the network incapable of coping with variable input resolutions, limiting the usage on downstream tasks (e.g., object detection and segmentation).…”

Section: Introductionmentioning

confidence: 99%

“…However, the fully connection across all tokens makes the network incapable of coping with variable input resolutions, limiting the usage on downstream tasks (e.g., object detection and segmentation). Manually designing deterministic rules to restrict the spatial information interaction within fixed regions [31,33,32,29,41,42] eliminates this resolution-fixed constraint but suffers from the lack of adaptability to various visual contents with diverse feature patterns in information interaction.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Active Token Mixer

Wei¹,

Zhang²,

Lan³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper presents ActiveMLP, a general MLP-like backbone for computer vision. The three existing dominant network families, i.e., CNNs, Transformers and MLPs, differ from each other mainly in the ways to fuse contextual information into a given token, leaving the design of more effective token-mixing mechanisms at the core of backbone architecture development. In ActiveMLP, we propose an innovative token-mixer, dubbed Active Token Mixer (ATM), to actively incorporate contextual information from other tokens in the global scope into the given one. This fundamental operator actively predicts where to capture useful contexts and learns how to fuse the captured contexts with the original information of the given token at channel levels. In this way, the spatial range of token-mixing is expanded and the way of token-mixing is reformed. With this design, ActiveMLP is endowed with the merits of global receptive fields and more flexible contentadaptive information fusion. Extensive experiments demonstrate that ActiveMLP is generally applicable and comprehensively surpasses different families of SOTA vision backbones by a clear margin on a broad range of vision tasks, including visual recognition and dense prediction tasks. The code and models will be available at https://github.com/microsoft/ActiveMLP.

show abstract

“…Recent top-performing approaches to solving video understanding tasks are based on supervised learning with a large amount of labeled data for training. Due to the strong data fitting capacity of deep convolutional neural networks, competitive performance can be achieved for recognizing actions in videos (Carreira et al 2017;Zhang et al 2021). One of the key factors for the success may owe to the strong correlation between action class and object/background known as representation bias in (Li et al 2018;Choi et al 2019).…”

Section: Introductionmentioning

confidence: 99%

Suppressing Static Visual Cues via Normalizing Flows for Self-Supervised Video Representation Learning

Zhang¹,

Wang²,

Ma³

2021

Preprint

View full text Add to dashboard Cite

Despite the great progress in video understanding made by deep convolutional neural networks, feature representation learned by existing methods may be biased to static visual cues. To address this issue, we propose a novel method to suppress static visual cues (S 2 VC) based on probabilistic analysis for self-supervised video representation learning. In our method, video frames are first encoded to obtain latent variables under standard normal distribution via normalizing flows. By modelling static factors in a video as a random variable, the conditional distribution of each latent variable becomes shifted and scaled normal. Then, the less-varying latent variables along time are selected as static cues and suppressed to generate motion-preserved videos. Finally, positive pairs are constructed by motion-preserved videos for contrastive learning to alleviate the problem of representation bias to static cues. The less-biased video representation can be better generalized to various downstream tasks. Extensive experiments on publicly available benchmarks demonstrate that the proposed method outperforms the state of the art when only single RGB modality is used for pre-training. The code is available at https://github.com/mettyz/SSVC.

show abstract

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

Cited by 6 publications

References 32 publications

Video-Text Pre-training with Learned Regions

Video-Text Pre-training with Learned Regions

Active Token Mixer

Suppressing Static Visual Cues via Normalizing Flows for Self-Supervised Video Representation Learning

Contact Info

Product

Resources

About