MAXIM: Multi-Axis MLP for Image Processing

Tu, Zhengzhong; Talebi, Hossein; Zhang, Han; Feng, Yijun; Milanfar, Peyman; Bovik, Alan C.; Li, Yinxiao

doi:10.48550/arxiv.2201.02973

Cited by 8 publications

(9 citation statements)

References 72 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Tu et al. 130 propose MAXIM, a UNet-shaped hierarchical structure that supports long-range interactions enabled by spatially gated MLPs. MAXIM contains two MLP-based building blocks: a multi-axis-gated MLP and a cross-gating block, both are variants of the gMLP block.…”

Section: Applications Of Mlp Variantsmentioning

confidence: 99%

Are we ready for a new paradigm shift? A survey on visual deep MLP

Liu

Tao

et al. 2022

Patterns

View full text Add to dashboard Cite

Section: Applications Of Mlp Variantsmentioning

confidence: 99%

Are we ready for a new paradigm shift? A survey on visual deep MLP

Liu

Tao

et al. 2022

Patterns

View full text Add to dashboard Cite

“…Sequential vs. parallel. In our approach, we sequentially stack the multi-axis attention modules following [54,84], while there also exist other models that adopt a parallel design [81,98].…”

Section: Ablation Studiesmentioning

confidence: 99%

MaxViT: Multi-Axis Vision Transformer

Tu¹,

Talebi²,

Zhang³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. We also present a new architectural element by effectively blending our proposed attention model with convolutions, and accordingly propose a simple hierarchical vision backbone, dubbed MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to "see" globally throughout the entire network, even in earlier, high-resolution stages. We demonstrate the effectiveness of our model on a broad spectrum of vision tasks. On image classification, MaxViT achieves state-of-the-art performance under various settings: without extra data, MaxViT attains 86.5% ImageNet-1K top-1 accuracy; with ImageNet-21K pre-training, our model achieves 88.7% top-1 accuracy. For downstream tasks, MaxViT as a backbone delivers favorable performance on object detection as well as visual aesthetic assessment. We also show that our proposed model expresses strong generative modeling capability on ImageNet, demonstrating the superior potential of MaxViT blocks as a universal vision module. We will make the code and models publicly available.

show abstract

“…It also achieves promising results in restoration tasks [11,38,80,43,4,37,18,20,5,89,46,72]. In particular, for video restoration, Cao et al [4] propose the first transformer model for video SR, while Liang et al [37] propose an unified framework for video SR, deblurring and denoising.…”

Section: Vision Transformermentioning

confidence: 99%

Recurrent Video Restoration Transformer with Guided Deformable Attention

Liang¹,

Yi²,

Xiang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Video restoration aims at restoring multiple high-quality frames from multiple lowquality frames. Existing video restoration methods generally fall into two extreme cases, i.e., they either restore all frames in parallel or restore the video frame by frame in a recurrent way, which would result in different merits and drawbacks. Typically, the former has the advantage of temporal information fusion. However, it suffers from large model size and intensive memory consumption; the latter has a relatively small model size as it shares parameters across frames; however, it lacks long-range dependency modeling ability and parallelizability. In this paper, we attempt to integrate the advantages of the two cases by proposing a recurrent video restoration transformer, namely RVRT. RVRT processes local neighboring frames in parallel within a globally recurrent framework which can achieve a good trade-off between model size, effectiveness, and efficiency. Specifically, RVRT divides the video into multiple clips and uses the previously inferred clip feature to estimate the subsequent clip feature. Within each clip, different frame features are jointly updated with implicit feature aggregation. Across different clips, the guided deformable attention is designed for clip-to-clip alignment, which predicts multiple relevant locations from the whole inferred clip and aggregates their features by the attention mechanism. Extensive experiments on video super-resolution, deblurring, and denoising show that the proposed RVRT achieves state-of-the-art performance on benchmark datasets with balanced model size, testing memory and runtime. The codes are available at https://github.com/JingyunLiang/RVRT.Preprint. Under review.

show abstract

MAXIM: Multi-Axis MLP for Image Processing

Cited by 8 publications

References 72 publications

Are we ready for a new paradigm shift? A survey on visual deep MLP

Are we ready for a new paradigm shift? A survey on visual deep MLP

MaxViT: Multi-Axis Vision Transformer

Recurrent Video Restoration Transformer with Guided Deformable Attention

Contact Info

Product

Resources

About