2022
DOI: 10.48550/arxiv.2207.08494
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Rethinking Alignment in Video Super-Resolution Transformers

Abstract: The alignment of adjacent frames is considered an essential operation in video super-resolution (VSR). Advanced VSR models, including the latest VSR Transformers, are generally equipped with well-designed alignment modules. However, the progress of the self-attention mechanism may violate this common sense. In this paper, we rethink the role of alignment in VSR Transformers and make several counter-intuitive observations. Our experiments show that: (i) VSR Transformers can directly utilize multi-frame informat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

2
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 38 publications
2
2
0
Order By: Relevance
“…The first effort we made in this paper was to introduce the large receptive field design into the attention mechanism. This is in line with other recent design trends using large kernel sizes [16], as well as the design principles of transformers [9,36,44]. We show the advantages of using large kernel convolutions in the attention branch.…”
Section: Introductionsupporting
confidence: 89%
See 1 more Smart Citation
“…The first effort we made in this paper was to introduce the large receptive field design into the attention mechanism. This is in line with other recent design trends using large kernel sizes [16], as well as the design principles of transformers [9,36,44]. We show the advantages of using large kernel convolutions in the attention branch.…”
Section: Introductionsupporting
confidence: 89%
“…Vision Transformers [15] rely on attention mechanisms to achieve excellent performance. Many works have proved that introducing large receptive fields and local windows [9,44] in the attention branch improves the SR effect. However, many advanced design ideas have not been verified in designing the attention mechanism for convolutional lightweight SR networks.…”
Section: Introductionmentioning
confidence: 99%
“…Second, we rethink the network structure of RFDN, simplify it and introduce large convolution kernels in the branches, following the design principles of recent studies on large-scale convolution kernels [13]. Some works have shown that introducing large receptive field convolution kernels and local windows [14,15] in the module branches can improve the performance of SR networks, which is also verified in this paper. Third, considering that large-scale convolution kernels bring performance gain at the cost of parameter size, we use depthwise separable convolution to split large convolution kernels, and implement large receptive field convolution operations by depthwise separable convolution and depthwise separable dilated convolution.…”
supporting
confidence: 66%
“…Vision Transformers treat input pixels as tokens and use self-attention operations to process interactions between these tokens. Inspired by the success of vision Transformers, many attempts have been made to employ Transformers for low-level vision tasks [10,14,15,46,63,68,71,75,78,79] During the development of these models, the noise pattern used for training is often consistent with the testing one. The factor that determines its denoising performance is the fitting ability of the network, in other words, the ability of the network to overfit to the training noise.…”
Section: Related Workmentioning
confidence: 99%