RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution

Geng, Zhicheng; Liang, Luming; Ding, Tianyu; Zharkov, Ilya

doi:10.48550/arxiv.2203.14186

Cited by 3 publications

(6 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…SwinIR [21] used Swin Transformer to handle the image restoration task and proposed residual Swin Transformer blocks. RSTT [35] built a spatial-temporal transformer that naturally incorporates the spatial and temporal super-resolution modules into a single model.…”

Section: Vision Transformermentioning

confidence: 99%

“…Referring to [35] , we designed a hierarchical U-shaped Transformer named Real-time Video Frame Interpolation Transformer (RVFIT), which spatially expands input video sequences while considering temporal fluency without dividing the model into temporal and spatial super-resolution modules. This design is superior to previous CNN-based frame interpolation methods because of its parallelism in structure, which can accelerate the inference process based on guaranteed performance.…”

Section: Network Overviewmentioning

confidence: 99%

“…Actually, a reusable dictionaries [35] is built in each , combining attention calculated result and relative position bias [16] of input frame windowed partitions.The detail of the encoder architecture are presented in Section 3.2.…”

Section: Network Overviewmentioning

confidence: 99%

See 2 more Smart Citations

RVFIT: Real-time Video Frame Interpolation Transformer

Ow¹,

Chen

2023

International Workshop on Frontiers of Graphics and Image Processing (FGIP 2022)

View full text Add to dashboard Cite

Video frame interpolation (VFI), which aims to synthesize predictive frames from bidirectional historical references, has made remarkable progress with the development of deep convolutional neural networks (CNNs) over the past years. Existing CNNs generally face challenges in handing large motions due to the locality of convolution operations, resulting in a slow inference structure. We introduce a Real-time video frame interpolation transformer (RVFIT), a novel framework to overcome this limitation. Unlike traditional methods based on CNNs, this paper does not process video frames separately with different network modules in the spatial domain but batches adjacent frames through a single UNet-style structure end-to-end Transformer network architecture. Moreover, this paper creatively sets up two-stage interpolation sampling before and after the end-to-end network to maximize the performance of the traditional CV algorithm. The experimental results show that compared with SOTA TMNet, RVFIT has only 50% of the network size (6.2M vs 12.3M, parameters) while ensuring comparable performance, and the speed is increased by 80% (26.1 fps vs 14.3 fps, frame size is 720*576).

show abstract

Section: Vision Transformermentioning

confidence: 99%

Section: Network Overviewmentioning

confidence: 99%

See 1 more Smart Citation

RVFIT: Real-time Video Frame Interpolation Transformer

Ow¹,

Chen

2023

International Workshop on Frontiers of Graphics and Image Processing (FGIP 2022)

View full text Add to dashboard Cite

show abstract

“…SwinIR [39] using Swin Transformer to handle the image restoration task and proposed residual Swin Transformer blocks. RSTT [40] built a spatial-temporal transformer that naturally incorporates the spatial and temporal super-resolution modules into a single model.…”

Section: Video Transformermentioning

confidence: 99%

RVSRT: real-time video super resolution transformer

Chen

2023

Fourteenth International Conference on Graphics and Image Processing (ICGIP 2022)

View full text Add to dashboard Cite

Video super-resolution is the task of converting low-resolution video to high-resolution video. Existing methods with better intuitive effects are mainly based on convolutional neural networks (CNNs), but the architecture is heavy, resulting in a slow inference structure. Aiming at this problem, this paper proposes a real-time video super-resolution. Real-time video super resolution transformer (RVSRT) can quickly complete the super-resolution task while considering the visual fluency of video frame switching. Unlike traditional methods based on CNNs, this paper does not process video frames separately with different network modules in the temporal domain, but batches adjacent frames through a single UNet-style structure end-to-end Transformer network architecture. Moreover, this paper creatively sets up two-stage interpolation sampling before and after the end-to-end network to maximize the performance of the traditional CV algorithm. The experimental results show that compared with SOTA TMNet, RVSRT has only 50% of the network size (6.1M vs 12.3M, parameters) while ensuring comparable performance, and the speed is increased by 80% (26.2 fps vs 14.3 fps, frame size is 720*576).

show abstract

“…It also achieves promising results in restoration tasks [11,38,80,43,4,37,18,20,5,89,46,72]. In particular, for video restoration, Cao et al [4] propose the first transformer model for video SR, while Liang et al [37] propose an unified framework for video SR, deblurring and denoising.…”

Section: Vision Transformermentioning

confidence: 99%

Recurrent Video Restoration Transformer with Guided Deformable Attention

Liang¹,

Yi²,

Xiang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Video restoration aims at restoring multiple high-quality frames from multiple lowquality frames. Existing video restoration methods generally fall into two extreme cases, i.e., they either restore all frames in parallel or restore the video frame by frame in a recurrent way, which would result in different merits and drawbacks. Typically, the former has the advantage of temporal information fusion. However, it suffers from large model size and intensive memory consumption; the latter has a relatively small model size as it shares parameters across frames; however, it lacks long-range dependency modeling ability and parallelizability. In this paper, we attempt to integrate the advantages of the two cases by proposing a recurrent video restoration transformer, namely RVRT. RVRT processes local neighboring frames in parallel within a globally recurrent framework which can achieve a good trade-off between model size, effectiveness, and efficiency. Specifically, RVRT divides the video into multiple clips and uses the previously inferred clip feature to estimate the subsequent clip feature. Within each clip, different frame features are jointly updated with implicit feature aggregation. Across different clips, the guided deformable attention is designed for clip-to-clip alignment, which predicts multiple relevant locations from the whole inferred clip and aggregates their features by the attention mechanism. Extensive experiments on video super-resolution, deblurring, and denoising show that the proposed RVRT achieves state-of-the-art performance on benchmark datasets with balanced model size, testing memory and runtime. The codes are available at https://github.com/JingyunLiang/RVRT.Preprint. Under review.

show abstract

RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution

Cited by 3 publications

References 0 publications

RVFIT: Real-time Video Frame Interpolation Transformer

RVFIT: Real-time Video Frame Interpolation Transformer

RVSRT: real-time video super resolution transformer

Recurrent Video Restoration Transformer with Guided Deformable Attention

Contact Info

Product

Resources

About