ViViT: A Video Vision Transformer

Arnab, Anurag; Dehghani, Mostafa; Heigold, Georg; Sun, Cheng; Lučić, Mario; Schmid, Cordelia

doi:10.48550/arxiv.2103.15691

Cited by 106 publications

(256 citation statements)

References 0 publications

Supporting

Mentioning

255

Contrasting

Order By: Relevance

“…Recently, Transformer-based models [38,67,83,90] have achieved promising performance in various vision tasks, such as image recognition [6,14,21,39,[50][51][52]52,75,90] and image restoration [11,40,89]. Some methods have tried to use Transformer for video modelling by extending the attention mechanism to the temporal dimension [2,3,38,53,60]. However, most of them are designed for visual recognition, which are fundamentally different from restoration tasks.…”

Section: Vision Transformermentioning

confidence: 99%

VRT: A Video Restoration Transformer

Liang¹,

Cao²,

Yi³

et al. 2022

Preprint

View full text Add to dashboard Cite

Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames. Different from single image restoration, video restoration generally requires to utilize temporal information from multiple adjacent but usually misaligned video frames. Existing deep methods generally tackle with this by exploiting a sliding window strategy or a recurrent architecture, which either is restricted by frame-by-frame restoration or lacks longrange modelling ability. In this paper, we propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities. More specifically, VRT is composed of multiple scales, each of which consists of two kinds of modules: temporal mutual self attention (TMSA) and parallel warping. TMSA divides the video into small clips, on which mutual attention is applied for joint motion estimation, feature alignment and feature fusion, while self attention is used for feature extraction. To enable cross-clip interactions, the video sequence is shifted for every other layer. Besides, parallel warping is used to further fuse information from neighboring frames by parallel feature warping. Experimental results on three tasks, including video super-resolution, video deblurring and video denoising, demonstrate that VRT outperforms the state-of-the-art methods by large margins (up to 2.16dB) on nine benchmark datasets.

show abstract

Section: Vision Transformermentioning

confidence: 99%

VRT: A Video Restoration Transformer

Liang¹,

Cao²,

Yi³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The elegance of ViT [23] has also motivated similar model designs with simpler global operators such as MLP-Mixer [85], gMLP [53], GFNet [74], and FNet [43], to name a few. Despite successful applications to many high-level tasks [4,23,56,83,87,100], the efficacy of these global models on low-level enhancement and restoration problems has not been studied extensively. The pioneering works on Transformers for lowlevel vision [9,14] directly applied full self-attention, which only accepts relatively small patches of fixed sizes (e.g., 48×48).…”

Section: Enhancementmentioning

confidence: 99%

MAXIM: Multi-Axis MLP for Image Processing

Tu¹,

Talebi²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recent progress on Transformers and multi-layer perceptron (MLP) models provide new network architectural designs for computer vision tasks. Although these models proved to be effective in many vision tasks such as image recognition, there remain challenges in adapting them for low-level vision. The inflexibility to support high-resolution images and limitations of local attention are perhaps the main bottlenecks for using Transformers and MLPs in image restoration. In this work we present a multi-axis MLP based architecture, called MAXIM, that can serve as an efficient and flexible general-purpose vision backbone for image processing tasks. MAXIM uses a UNet-shaped hierarchical structure and supports long-range interactions enabled by spatially-gated MLPs. Specifically, MAXIM contains two MLP-based building blocks: a multi-axis gated MLP that allows for efficient and scalable spatial mixing of local and global visual cues, and a cross-gating block, an alternative to cross-attention, which accounts for crossfeature mutual conditioning. Both these modules are exclusively based on MLPs, but also benefit from being both global and 'fully-convolutional', two properties that are desirable for image processing. Our extensive experimental results show that the proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks, including denoising, deblurring, deraining, dehazing, and enhancement while requiring fewer or comparable numbers of parameters and FLOPs than competitive models.

show abstract

“…Apart from the sequence-to-sequence structure, the efficiency of PVT [39] and Swin Transformer [30] sparks much interests in exploring the Hierarchical Vision Transformer (HVT) [14,22,41,44]. ViT is also extended to solve the low-level tasks and dense prediction problems [2,6,20]. Specially, concurrent semantic segmentation methods driven by ViT presents impressive performance.…”

Section: Vision Transformermentioning

confidence: 99%

Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention

Yan¹,

Zhang²,

Wu³

2022

Preprint

View full text Add to dashboard Cite

Multi-scale representations are crucial for semantic segmentation. The community has witnessed the flourish of semantic segmentation convolutional neural networks (CNN) exploiting multi-scale contextual information. Motivated by that the vision transformer (ViT) is powerful in image classification, some semantic segmentation ViTs are recently proposed, most of them attaining impressive results but at a cost of computational economy. In this paper, we succeed in introducing multi-scale representations into semantic segmentation ViT via window attention mechanism and further improves the performance and efficiency. To this end, we introduce large window attention which allows the local window to query a larger area of context window at only a little computation overhead. By regulating the ratio of the context area to the query area, we enable the large window attention to capture the contextual information at multiple scales. Moreover, the framework of spatial pyramid pooling is adopted to collaborate with the large window attention, which presents a novel decoder named large window attention spatial pyramid pooling (LawinASPP) for semantic segmentation ViT. Our resulting ViT, Lawin Transformer, is composed of an efficient hierachical vision transformer (HVT) as encoder and a LawinASPP as decoder. The empirical results demonstrate that Lawin Transformer offers an improved efficiency compared to the existing method. Lawin Transformer further sets new state-of-the-art performance on Cityscapes (84.4% mIoU), ADE20K (56.2% mIoU) and COCO-Stuff datasets. The code will be released at https://github.com/yan-hao-tian/lawin.

show abstract

ViViT: A Video Vision Transformer

Cited by 106 publications

References 0 publications

VRT: A Video Restoration Transformer

VRT: A Video Restoration Transformer

MAXIM: Multi-Axis MLP for Image Processing

Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention

Contact Info

Product

Resources

About