Uformer: A General U-Shaped Transformer for Image Restoration

Wang, Zhendong; Cun, Xiaodong; Bao, Jianmin; Liu, Jianzhuang

doi:10.48550/arxiv.2106.03106

Cited by 72 publications

(124 citation statements)

References 81 publications

Supporting

Mentioning

104

Contrasting

Order By: Relevance

“…Recently, Transformer-based models [38,67,83,90] have achieved promising performance in various vision tasks, such as image recognition [6,14,21,39,[50][51][52]52,75,90] and image restoration [11,40,89]. Some methods have tried to use Transformer for video modelling by extending the attention mechanism to the temporal dimension [2,3,38,53,60].…”

Section: Vision Transformermentioning

confidence: 99%

VRT: A Video Restoration Transformer

Liang¹,

Cao²,

Yi³

et al. 2022

Preprint

View full text Add to dashboard Cite

Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames. Different from single image restoration, video restoration generally requires to utilize temporal information from multiple adjacent but usually misaligned video frames. Existing deep methods generally tackle with this by exploiting a sliding window strategy or a recurrent architecture, which either is restricted by frame-by-frame restoration or lacks longrange modelling ability. In this paper, we propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities. More specifically, VRT is composed of multiple scales, each of which consists of two kinds of modules: temporal mutual self attention (TMSA) and parallel warping. TMSA divides the video into small clips, on which mutual attention is applied for joint motion estimation, feature alignment and feature fusion, while self attention is used for feature extraction. To enable cross-clip interactions, the video sequence is shifted for every other layer. Besides, parallel warping is used to further fuse information from neighboring frames by parallel feature warping. Experimental results on three tasks, including video super-resolution, video deblurring and video denoising, demonstrate that VRT outperforms the state-of-the-art methods by large margins (up to 2.16dB) on nine benchmark datasets.

show abstract

Section: Vision Transformermentioning

confidence: 99%

VRT: A Video Restoration Transformer

Liang¹,

Cao²,

Yi³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Such a strategy will inevitably cause patch boundary artifacts when applied on larger images using crop-ping [14]. Local-attention based Transformers [51,95] ameliorate this issue, but they are also constrained to have limited sizes of receptive field, or to lose non-locality [23,91], which is a compelling property of Transformers and MLP models relative to hierarchical CNNs.…”

Section: Enhancementmentioning

confidence: 99%

“…Advanced components developed for high-level vision tasks have been brought into lowlevel vision tasks as well. Residual and dense connections [42,93,93,117,118], the multi-scale feature learning [19,40,95], attention mechanisms [64,89,107,108,118],…”

Section: Related Workmentioning

confidence: 99%

“…The authors of [9] presented a spatial-temporal convolutional self-attention network that exploits local information for video super-resolution. More recently, Swin-IR [51] and UFormer [95] apply efficient window-based local attention models on a range of image restoration tasks. MLP vision models.…”

Section: Related Workmentioning

confidence: 99%

“…We present, to the best of our knowledge, the first effective general-purpose MLP architecture for low-level vision, which we call Multi-AXIs MLP for image processing (MAXIM). Unlike previous low-level Transformers [9,14,51,95], MAXIM has several desired properties, making it intriguing for image processing tasks. First, MAXIM expresses global receptive fields on arbitrarily large images with linear complexity; Second, it directly supports arbitrary input resolutions, i.e., being fully-convolutional; Lastly, it provides a balanced design of local (Conv) and global (MLP) blocks, outperforming SOTA methods without the necessity for large-scale pre-training [14].…”

Section: Our Approach: Maximmentioning

confidence: 99%

See 2 more Smart Citations

MAXIM: Multi-Axis MLP for Image Processing

Tu¹,

Talebi²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recent progress on Transformers and multi-layer perceptron (MLP) models provide new network architectural designs for computer vision tasks. Although these models proved to be effective in many vision tasks such as image recognition, there remain challenges in adapting them for low-level vision. The inflexibility to support high-resolution images and limitations of local attention are perhaps the main bottlenecks for using Transformers and MLPs in image restoration. In this work we present a multi-axis MLP based architecture, called MAXIM, that can serve as an efficient and flexible general-purpose vision backbone for image processing tasks. MAXIM uses a UNet-shaped hierarchical structure and supports long-range interactions enabled by spatially-gated MLPs. Specifically, MAXIM contains two MLP-based building blocks: a multi-axis gated MLP that allows for efficient and scalable spatial mixing of local and global visual cues, and a cross-gating block, an alternative to cross-attention, which accounts for crossfeature mutual conditioning. Both these modules are exclusively based on MLPs, but also benefit from being both global and 'fully-convolutional', two properties that are desirable for image processing. Our extensive experimental results show that the proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks, including denoising, deblurring, deraining, dehazing, and enhancement while requiring fewer or comparable numbers of parameters and FLOPs than competitive models.

show abstract