2024
DOI: 10.1049/ipr2.13096
|View full text |Cite
|
Sign up to set email alerts
|

Local feature‐based video captioning with multiple classifier and CARU‐attention

Sio‐Kei Im,
Ka‐Hou Chan

Abstract: Video captioning aims to identify multiple objects and their behaviours in a video event and generate captions for the current scene. This task aims to generate a detailed description of the current video in real‐time using natural language, which requires deep learning to analyze and determine the relationships between interesting objects in the frame sequence. In practice, existing methods typically involve detecting objects in the frame sequence and then generating captions based on features extracted throu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
0
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 68 publications
0
0
0
Order By: Relevance
“…Niklaus et al [23] propose adaptive separable convolution, replacing the original 2D convolution kernel with a pair of 1D convolution kernels, which reduces the number of operations and the number of parameters of the model. In video processing tasks, deformable convolution (DConv) has been shown to enhance the flexibility of network encoders [26]. Inspired by DConv [27], Lee et al [7] propose AdaCoF, a model with a learned deformable spatial convolution kernel, which solves the problem of limited degrees of freedom for ordinary convolution kernels.…”
Section: Video Frame Interpolationmentioning
confidence: 99%
“…Niklaus et al [23] propose adaptive separable convolution, replacing the original 2D convolution kernel with a pair of 1D convolution kernels, which reduces the number of operations and the number of parameters of the model. In video processing tasks, deformable convolution (DConv) has been shown to enhance the flexibility of network encoders [26]. Inspired by DConv [27], Lee et al [7] propose AdaCoF, a model with a learned deformable spatial convolution kernel, which solves the problem of limited degrees of freedom for ordinary convolution kernels.…”
Section: Video Frame Interpolationmentioning
confidence: 99%