Local feature‐based video captioning with multiple classifier and CARU‐attention

Im, Sio‐Kei; Chan, Ka‐Hou

doi:10.1049/ipr2.13096

Cited by 1 publication

(1 citation statement)

References 68 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Niklaus et al [23] propose adaptive separable convolution, replacing the original 2D convolution kernel with a pair of 1D convolution kernels, which reduces the number of operations and the number of parameters of the model. In video processing tasks, deformable convolution (DConv) has been shown to enhance the flexibility of network encoders [26]. Inspired by DConv [27], Lee et al [7] propose AdaCoF, a model with a learned deformable spatial convolution kernel, which solves the problem of limited degrees of freedom for ordinary convolution kernels.…”

Section: Video Frame Interpolationmentioning

confidence: 99%

Parallel Spatio-Temporal Attention Transformer for Video Frame Interpolation

Ning,

Cai,

et al. 2024

Electronics

View full text Add to dashboard Cite

Traditional video frame interpolation methods based on deep convolutional neural networks face challenges in handling large motions. Their performance is limited by the fact that convolutional operations cannot directly integrate the rich temporal and spatial information of inter-frame pixels, and these methods rely heavily on additional inputs such as optical flow to model motion. To address this issue, we develop a novel framework for video frame interpolation that uses Transformer to efficiently model the long-range similarity of inter-frame pixels. Furthermore, to effectively aggregate spatio-temporal features, we design a novel attention mechanism divided into temporal attention and spatial attention. Specifically, spatial attention is used to aggregate intra-frame information, integrating both attention and convolution paradigms through the simple mapping approach. Temporal attention is used to model the similarity of pixels on the timeline. This design achieves parallel processing of these two types of information without extra computational cost, aggregating information in the space–time dimension. In addition, we introduce a context extraction network and multi-scale prediction frame synthesis network to further optimize the performance of the Transformer. Our method and state-of-the-art methods are extensively quantitatively and qualitatively experimented on various benchmark datasets. On the Vimeo90K and UCF101 datasets, our model achieves improvements of 0.09 dB and 0.01 dB in the PSNR metrics over UPR-Net-large, respectively. On the Vimeo90K dataset, our model outperforms FLAVR by 0.07 dB, with only 40.56% of its parameters. The qualitative results show that for complex and large-motion scenes, our method generates sharper and more realistic edges and details.

show abstract