VRT: A Video Restoration Transformer

Liang, Jingyun; Cao, Jiezhang; Yi, Fan; Zhang, Kai; Ranjan, Rakesh; Li, Yawei; Timofte, Radu; Gool, Luc Van

doi:10.48550/arxiv.2201.12288

Cited by 27 publications

(69 citation statements)

References 58 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Although vision Transformer has shown its superiority on modeling long-range dependency [13,43], there are still many works demonstrating that the convolution can help Transformer achieve better visual representation [56,58,61,60,25]. Due to the impressive performance, Transformer has also been introduced for low-level vision tasks [5,54,37,29,3,62,28,26]. Specifically, [5] develops a ViT-style network and introduces multi-task pre-training for image processing.…”

Section: Vision Transformermentioning

confidence: 99%

See 1 more Smart Citation

Activating More Pixels in Image Super-Resolution Transformer

Chen¹,

Wang²,

Zhou³

et al. 2022

Preprint

View full text Add to dashboard Cite

Transformer-based methods have shown impressive performance in low-level vision tasks, such as image super-resolution. However, we find that these networks can only utilize a limited spatial range of input information through attribution analysis. This implies that the potential of Transformer is still not fully exploited in existing networks. In order to activate more input pixels for reconstruction, we propose a novel Hybrid Attention Transformer (HAT). It combines channel attention and self-attention schemes, thus making use of their complementary advantages. Moreover, to better aggregate the cross-window information, we introduce an overlapping cross-attention module to enhance the interaction between neighboring window features. In the training stage, we additionally propose a same-task pre-training strategy to bring further improvement. Extensive experiments show the effectiveness of the proposed modules, and the overall method significantly outperforms the state-of-the-art methods by more than 1dB. Codes and models will be available at https://github.com/chxy95/HAT.

show abstract

Section: Vision Transformermentioning

confidence: 99%

“…SwinIR [29] proposes an image restoration Transformer based on [36]. [3,28] introduce Transformer-based networks to video restoration. [26] adopts self-attention mechanism and multirelated-task pre-training strategy to further refresh the state-of-the-art of SR.…”

Section: Vision Transformermentioning

confidence: 99%

Activating More Pixels in Image Super-Resolution Transformer

Chen¹,

Wang²,

Zhou³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Attention-based network, i.e., Transformer, have shown great performance and gained much popularity in various high-level computer vision tasks [7,9,16,34,35,53,55]. Recently, Transformer has also been introduced for low-level vision and tends to learn global interactions to focus on enhancing details and important regions [8,11,31,32,52]. Chen et al [11] were the first propose to use Transformer-based backbone IPT for various image restoration problems.…”

Section: Related Workmentioning

confidence: 99%

“…Further, we apply a pyramid structure to improve the alignment on the top of the flow-guided DCN. On the other hand, the self-attention mechanism and Transformer have shown promising performance in most computer vision tasks [31,32,35]. Therefore, to better use the inter-frame information, we incorporate Swin Transformer blocks and groups in our architecture to capture both global and local contexts for long-range dependency modeling [32,35].…”

Section: Introductionmentioning

confidence: 99%

BSRT: Improving Burst Super-Resolution with Swin Transformer and Flow-Guided Deformable Alignment

Luo¹,

Li²,

Cheng³

et al. 2022

Preprint

View full text Add to dashboard Cite

This work addresses the Burst Super-Resolution (BurstSR) task using a new architecture, which requires restoring a high-quality image from a sequence of noisy, misaligned, and low-resolution RAW bursts. To overcome the challenges in BurstSR, we propose a Burst Super-Resolution Transformer (BSRT), which can significantly improve the capability of extracting inter-frame information and reconstruction. To achieve this goal, we propose a Pyramid Flow-Guided Deformable Convolution Network (Pyramid FG-DCN) and incorporate Swin Transformer Blocks and Groups as our main backbone. More specifically, we combine optical flows and deformable convolutions, hence our BSRT can handle misalignment and aggregate the potential texture information in multiframes more efficiently. In addition, our Transformer-based structure can capture long-range dependency to further improve the performance. The evaluation on both synthetic and real-world tracks demonstrates that our approach achieves a new state-of-the-art in BurstSR task. Further, our BSRT wins the championship in the NTIRE2022 Burst Super-Resolution Challenge.

show abstract

“…However, the RNN-based methods inevidently suffer from the vanishing gradient problem and have difficulty in capturing the long-range temporal dependencies. Recently, the emerging Transformer model has been applied in image and video restoration tasks (Cai et al, 2021b;Liang et al, 2022;Lin et al, 2022b;Cao et al, 2021;Cai et al, 2022). Nonetheless, the token-based selfattention module has enormous computational and memory cost in restoring long video sequence.…”

Section: Video Restorationmentioning

confidence: 99%

Unsupervised Flow-Aligned Sequence-to-Sequence Learning for Video Restoration

Jiang¹,

Hu²,

Cai³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

How to properly model the inter-frame relation within the video sequence is an important but unsolved challenge for video restoration (VR). In this work, we propose an unsupervised flowaligned sequence-to-sequence model (S2SVR) to address this problem. On the one hand, the sequence-to-sequence model, which has proven capable of sequence modeling in the field of natural language processing, is explored for the first time in VR. Optimized serialization modeling shows potential in capturing long-range dependencies among frames. On the other hand, we equip the sequence-to-sequence model with an unsupervised optical flow estimator to maximize its potential. The flow estimator is trained with our proposed unsupervised distillation loss, which can alleviate the data discrepancy and inaccurate degraded optical flow issues of previous flowbased methods. With reliable optical flow, we can establish accurate correspondence among multiple frames, narrowing the domain difference between 1D language and 2D misaligned frames and improving the potential of the sequence-tosequence model. S2SVR shows superior performance in multiple VR tasks, including video deblurring, video super-resolution, and compressed video quality enhancement. https://github. com/linjing7/VR-Baseline

show abstract

VRT: A Video Restoration Transformer

Cited by 27 publications

References 58 publications

Activating More Pixels in Image Super-Resolution Transformer

Activating More Pixels in Image Super-Resolution Transformer

BSRT: Improving Burst Super-Resolution with Swin Transformer and Flow-Guided Deformable Alignment

Unsupervised Flow-Aligned Sequence-to-Sequence Learning for Video Restoration

Contact Info

Product

Resources

About