Rethinking Alignment in Video Super-Resolution Transformers

Shi, Shuwei; Gu, Jinjin; Xie, Liangbin; Wang, Xintao; Yang, Yujiu; Chen, Dong

doi:10.48550/arxiv.2207.08494

Cited by 3 publications

(4 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The first effort we made in this paper was to introduce the large receptive field design into the attention mechanism. This is in line with other recent design trends using large kernel sizes [16], as well as the design principles of transformers [9,36,44]. We show the advantages of using large kernel convolutions in the attention branch.…”

Section: Introductionsupporting

confidence: 89%

“…Vision Transformers [15] rely on attention mechanisms to achieve excellent performance. Many works have proved that introducing large receptive fields and local windows [9,44] in the attention branch improves the SR effect. However, many advanced design ideas have not been verified in designing the attention mechanism for convolutional lightweight SR networks.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Efficient Image Super-Resolution using Vast-Receptive-Field Attention

Zhang¹,

Cai²,

Gu³

et al. 2022

Preprint

View full text Add to dashboard Cite

The attention mechanism plays a pivotal role in designing advanced super-resolution (SR) networks. In this work, we design an efficient SR network by improving the attention mechanism. We start from a simple pixel attention module and gradually modify it to achieve better super-resolution performance with reduced parameters. The specific approaches include: (1) increasing the receptive field of the attention branch, (2) replacing large dense convolution kernels with depth-wise separable convolutions, and (3) introducing pixel normalization. These approaches paint a clear evolutionary roadmap for the design of attention mechanisms. Based on these observations, we propose VapSR, the VAst-receptive-field Pixel attention network. Experiments demonstrate the superior performance of VapSR. VapSR outperforms the present lightweight networks with even fewer parameters. And the light version of VapSR can use only 21.68% and 28.18% parameters of IMDB and RFDN to achieve similar performances to those networks. The code and models are available at https://github.com/zhoumumu/VapSR.

show abstract

Section: Introductionsupporting

confidence: 89%

Section: Introductionmentioning

confidence: 99%

Efficient Image Super-Resolution using Vast-Receptive-Field Attention

Zhang¹,

Cai²,

Gu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Second, we rethink the network structure of RFDN, simplify it and introduce large convolution kernels in the branches, following the design principles of recent studies on large-scale convolution kernels [13]. Some works have shown that introducing large receptive field convolution kernels and local windows [14,15] in the module branches can improve the performance of SR networks, which is also verified in this paper. Third, considering that large-scale convolution kernels bring performance gain at the cost of parameter size, we use depthwise separable convolution to split large convolution kernels, and implement large receptive field convolution operations by depthwise separable convolution and depthwise separable dilated convolution.…”

supporting

confidence: 66%

A lightweight image super-resolution network based on large receptive field information distillation

wu,

wang,

shen

et al. 2024

Second International Conference on Informatics, Networking, and Computing (ICINC 2023)

View full text Add to dashboard Cite

This paper tackles the problem of network structure redundancy in image super-resolution (SR) algorithms and aims to develop an efficient and low-parameter SR method. It proposes a lightweight SR network based on vast-receptive-field information distillation (VIDN), which enhances the model performance and reduces the parameter size by introducing large receptive field convolution and attention modules, optimizing convolution operations, and using simple gate functions instead of RELU activation functions. The network leverages large kernel convolution and multi-scale attention mechanisms to better capture global information and fuse local and global features of the image, respectively. The ESA and CCA modules complement each other to improve image clarity and realism, forming a powerful multi-scale attention mechanism. The experimental results on three scales and four benchmark datasets demonstrate that the proposed algorithm achieves an average PSNR improvement of 0.03db, an average parameter reduction of 53%, and clearer and more natural visual effects on images.

show abstract

“…Vision Transformers treat input pixels as tokens and use self-attention operations to process interactions between these tokens. Inspired by the success of vision Transformers, many attempts have been made to employ Transformers for low-level vision tasks [10,14,15,46,63,68,71,75,78,79] During the development of these models, the noise pattern used for training is often consistent with the testing one. The factor that determines its denoising performance is the fitting ability of the network, in other words, the ability of the network to overfit to the training noise.…”

Section: Related Workmentioning

confidence: 99%

Masked Image Training for Generalizable Deep Image Denoising

Chen¹,

Gu²,

Liu³

et al. 2023

Preprint

View full text Add to dashboard Cite

When capturing and storing images, devices inevitably introduce noise. Reducing this noise is a critical task called image denoising. Deep learning has become the de facto method for image denoising, especially with the emergence of Transformer-based models that have achieved notable state-of-the-art results on various image tasks. However, deep learning-based methods often suffer from a lack of generalization ability. For example, deep models trained on Gaussian noise may perform poorly when tested on other noise distributions. To address this issue, we present a novel approach to enhance the generalization performance of denoising networks, known as masked training. Our method involves masking random pixels of the input image and reconstructing the missing information during training. We also mask out the features in the self-attention layers to avoid the impact of training-testing inconsistency. Our approach exhibits better generalization ability than other deep learning models and is directly applicable to realworld scenarios. Additionally, our interpretability analysis demonstrates the superiority of our method.

show abstract

Rethinking Alignment in Video Super-Resolution Transformers

Cited by 3 publications

References 38 publications

Efficient Image Super-Resolution using Vast-Receptive-Field Attention

Efficient Image Super-Resolution using Vast-Receptive-Field Attention

A lightweight image super-resolution network based on large receptive field information distillation

Masked Image Training for Generalizable Deep Image Denoising

Contact Info

Product

Resources

About