Channel Attention Is All You Need for Video Frame Interpolation

Choi, Myungsub; Kim, Heewon; Han, Bohyung; Xu, Ning; Lee, Kyoung Mu

doi:10.1609/aaai.v34i07.6693

Cited by 239 publications

(150 citation statements)

References 31 publications

Supporting

Mentioning

131

Contrasting

Order By: Relevance

“…Regarding the frame interpolation as a local convolution over the two input frames, Niklaus et al [40], [41] utilized a CNN to learn a spatially-adaptive convolution kernel for each pixel. Choi et al [6] introduced a feature reshaping operation with Pixelshuffle [53] and a channel attention module for motion estimation. Lee et al [24] proposed adaptive collaboration of flows as a new warping module to deal with complex motions.…”

Section: Video Frame Interpolationmentioning

confidence: 99%

Zooming Slow-Mo: Fast and Accurate One-Stage Space-Time Video Super-Resolution

Xiang

Tian

Zhang

et al. 2020

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

174

210

View full text Add to dashboard Cite

In this paper, we address the space-time video super-resolution, which aims at generating a high-resolution (HR) slow-motion video from a low-resolution (LR) and low frame rate (LFR) video sequence. A naive method is to decompose it into two sub-tasks: video frame interpolation (VFI) and video super-resolution (VSR). Nevertheless, temporal interpolation and spatial upscaling are intra-related in this problem. Two-stage approaches cannot fully make use of this natural property. Besides, state-of-the-art VFI or VSR deep networks usually have a large frame reconstruction module in order to obtain high-quality photo-realistic video frames, which makes the two-stage approaches have large models and thus be relatively time-consuming. To overcome the issues, we present a one-stage space-time video super-resolution framework, which can directly reconstruct an HR slow-motion video sequence from an input LR and LFR video. Instead of reconstructing missing LR intermediate frames as VFI models do, we temporally interpolate LR frame features of the missing LR frames capturing local temporal contexts by a feature temporal interpolation module. Extensive experiments on widely used benchmarks demonstrate that the proposed framework not only achieves better qualitative and quantitative performance on both clean and noisy LR frames but also is several times faster than recent state-of-the-art two-stage networks. The source code is released in https:// github.com/ Mukosame/ Zooming-Slow-Mo-CVPR-2020.

show abstract

Section: Video Frame Interpolationmentioning

confidence: 99%

Zooming Slow-Mo: Fast and Accurate One-Stage Space-Time Video Super-Resolution

Xiang

Tian

Zhang

et al. 2020

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

174

210

View full text Add to dashboard Cite

show abstract

“…The attention mechanism in CNN understands and perceives images in a way that simulates humans and differentially weights global features to highlight key local features. In recent years, many channel attention [4], [64], [6], [65], [66], [67] mechanisms have been successfully applied to different computer vision tasks such as image classification, semantic segmentation, object detection, and image translation. Hu et al [2].…”

Section: Related Work (I)mentioning

confidence: 99%

Lightweight Channel Attention and Multiscale Feature Fusion Discrimination for Remote Sensing Scene Classification

et al. 2021

View full text Add to dashboard Cite

High-resolution remote sensing image scene classification has attracted widespread attention as a basic earth observation task. Remote sensing scene classification aims to assign specific semantic labels to remote sensing scene images to serve specified applications. Convolutional neural networks are widely used for remote sensing image classification due to their powerful feature extraction capabilities. However, the existing methods have not overcome the difficulties of large-scene remote sensing images of large intraclass diversity and high interclass similarity, resulting in low performance. Therefore, we propose a new remote sensing scene classification method that combines lightweight channel attention and multiscale feature fusion discrimination, called LmNet. First, ResNeXt is used as the backbone; second, a new lightweight channel attention mechanism is constructed to quickly and adaptively learn the salient features of important channels. Furthermore, we designed a multiscale feature fusion discrimination framework, which fully integrates shallow edge feature information and deep semantic information to enhance feature representation capabilities and uses multiscale features for joint discrimination. Finally, a cross-entropy loss function based on label smoothing is built to reduce the influence of interclass similarity on feature representation. In particular, our lightweight channel attention and multiscale feature fusion mechanism can be flexibly embedded in any advanced backbone as a functional module. The experimental results on three large-scale remote sensing scene classification datasets show that compared with the existing advanced methods, our proposed high-efficiency end-to-end scene classification method has reached state-of-the-art. Moreover, our method has a weaker dependence on labeled data and provided better generalization performance.INDEX TERMS Remote sensing scene classification, convolutional neural network, lightweight channel attention, multiscale feature fusion, label smoothing.

show abstract

“…We compare our method with 7 existing methods, including SepConv [17], CtxSyn [31], SoftSplat [32], DAIN [13], BMBC [33], CAIN [34] and RRIN [6] in Table 5. The values marked in red mean best performance, while blue represents second best.…”

Section: Quantitative Evaluationmentioning

confidence: 99%

DRVI: Dual Refinement for Video Interpolation

Zhou

Basu

2021

IEEE Access

View full text Add to dashboard Cite

The quality of a video clip is considered to be poor if the resolution or the frame rate is low. Video interpolation is thus introduced to enhance video quality and provide a better viewing experience to users. However, there are still some challenges, like the blur caused by motion changes. In this paper, we introduce a dual refinement technique for video interpolation (DRVI). It has three main steps, namely flow refinement, frame synthesis, and Haar refinement. The flow refinement can generate accurate bidirectional flows, which are more suitable for frame interpolation tasks. The Haar refinement uses the Discrete Wavelet Transform (DWT). It can preserve information in different frequency domains and also speed up the learning process. We also add an arbitrary time approximation module to allow multi-frame generation. The number of learnable parameters in our model is much less than existing methods; still, it has excellent performance. Our method is trained on Vimeo90K [1] and tested on three well-known datasets to demonstrate its effectiveness.

show abstract

Channel Attention Is All You Need for Video Frame Interpolation

Cited by 239 publications

References 31 publications

Zooming Slow-Mo: Fast and Accurate One-Stage Space-Time Video Super-Resolution

Zooming Slow-Mo: Fast and Accurate One-Stage Space-Time Video Super-Resolution

Lightweight Channel Attention and Multiscale Feature Fusion Discrimination for Remote Sensing Scene Classification

DRVI: Dual Refinement for Video Interpolation

Contact Info

Product

Resources

About