2020
DOI: 10.3390/electronics9071162
|View full text |Cite
|
Sign up to set email alerts
|

Context Aware Video Caption Generation with Consecutive Differentiable Neural Computer

Abstract: Recent video captioning models aim at describing all events in a long video. However, their event descriptions do not fully exploit the contextual information included in a video because they lack the ability to remember information changes over time. To address this problem, we propose a novel context-aware video captioning model that generates natural language descriptions based on the improved video context understanding. We introduce an external memory, differential neural computer (DNC), to improve video … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
1
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(2 citation statements)
references
References 21 publications
0
1
0
Order By: Relevance
“…Silvio et al [5] added Inception-Resnet-V2 to extract the motion features of the video and then used soft-attention LSTM (SA-LSTM) as the decoder. Kim et al [6] used CNN to extract 2D features of video frames and then fed the extracted features to a differential neural computer (DNC) to learn the contextual information. The features of video frames are input into a DNC in chronological order, and its memory can store contextual information and fully use contextual information to generate captions.…”
Section: Video Captionmentioning
confidence: 99%
“…Silvio et al [5] added Inception-Resnet-V2 to extract the motion features of the video and then used soft-attention LSTM (SA-LSTM) as the decoder. Kim et al [6] used CNN to extract 2D features of video frames and then fed the extracted features to a differential neural computer (DNC) to learn the contextual information. The features of video frames are input into a DNC in chronological order, and its memory can store contextual information and fully use contextual information to generate captions.…”
Section: Video Captionmentioning
confidence: 99%
“…A concatenated vector is used as inputs of the controller D. Two output vectors of the controller are generated: 1) a controller output vector and 2) an interface vector . The controller output vector is equal to the output of a hidden layer in the deep learning model [26]. The interface vector determines the memory address accessed at time-step t to perform the read and write operation [27].…”
Section: Structural Overview Of Differentiable Neural Computermentioning
confidence: 99%