Context Aware Video Caption Generation with Consecutive Differentiable Neural Computer

Kim, Jonghong; Choi, Inchul

doi:10.3390/electronics9071162

Cited by 9 publications

(2 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Silvio et al [5] added Inception-Resnet-V2 to extract the motion features of the video and then used soft-attention LSTM (SA-LSTM) as the decoder. Kim et al [6] used CNN to extract 2D features of video frames and then fed the extracted features to a differential neural computer (DNC) to learn the contextual information. The features of video frames are input into a DNC in chronological order, and its memory can store contextual information and fully use contextual information to generate captions.…”

Section: Video Captionmentioning

confidence: 99%

MFVC: Urban Traffic Scene Video Caption Based on Multimodal Fusion

Zhang

et al. 2022

Electronics

View full text Add to dashboard Cite

With the development of electronic technology, intelligent cars can gradually realize more complex artificial intelligence algorithms. The video caption algorithm is one of them. However, current video caption algorithms only consider single-visual information when applied to urban traffic scenes, which leads to the failure to generate accurate captions of complex sets. The multimodal fusion algorithm based on Transformer is one of the solutions to this problem. However, the existing algorithms have the difficulties of a low fusion performance and high computational complexity. We propose a new video caption Transformer-based model, the MFVC (Multimodal Fusion for Video Caption), to solve these issues. We introduce audio modal data and the attention bottleneck module to increase the available information to describe the generative model and improve the model effect with less operation costs through the attention bottleneck module. Finally, the experiment is conducted on the available datasets, MSR-VTT and MSVD. Meanwhile, to verify the effect of the model on the urban traffic scene, the experiment is carried out on the self-built traffic caption dataset BUUISE, and the evaluation index confirms the model. This model can achieve good results on both available datasets and urban traffic datasets and has excellent application prospects in the intelligent driving industry.

show abstract

Section: Video Captionmentioning

confidence: 99%

MFVC: Urban Traffic Scene Video Caption Based on Multimodal Fusion

Zhang

et al. 2022

Electronics

View full text Add to dashboard Cite

show abstract

“…A concatenated vector is used as inputs of the controller D. Two output vectors of the controller are generated: 1) a controller output vector and 2) an interface vector . The controller output vector is equal to the output of a hidden layer in the deep learning model [26]. The interface vector determines the memory address accessed at time-step t to perform the read and write operation [27].…”

Section: Structural Overview Of Differentiable Neural Computermentioning

confidence: 99%

Robustness of Differentiable Neural Computer Using Limited Retention Vector-based Memory Deallocation in Language Model

2021

KSII TIIS

View full text Add to dashboard Cite

Recurrent neural network (RNN) architectures have been used for language modeling (LM) tasks that require learning long-range word or character sequences. However, the RNN architecture is still suffered from unstable gradients on long-range sequences. To address the issue of long-range sequences, an attention mechanism has been used, showing state-of-theart (SOTA) performance in all LM tasks. A differentiable neural computer (DNC) is a deep learning architecture using an attention mechanism. The DNC architecture is a neural network augmented with a content-addressable external memory. However, in the write operation, some information unrelated to the input word remains in memory. Moreover, DNCs have been found to perform poorly with low numbers of weight parameters. Therefore, we propose a robust memory deallocation method using a limited retention vector. The limited retention vector determines whether the network increases or decreases its usage of information in external memory according to a threshold. We experimentally evaluate the robustness of a DNC implementing the proposed approach according to the size of the controller and external memory on the enwik8 LM task. When we decreased the number of weight parameters by 32.47%, the proposed DNC showed a low bits-per-character (BPC) degradation of 4.30%, demonstrating the effectiveness of our approach in language modeling tasks.

show abstract

Video Captioning Using Deep Learning Approach-A Comprehensive Survey

Jacob¹,

Devassia

2023

Proceedings in Adaptation, Learning and Optimization

View full text Add to dashboard Cite

Context Aware Video Caption Generation with Consecutive Differentiable Neural Computer

Cited by 9 publications

References 21 publications

MFVC: Urban Traffic Scene Video Caption Based on Multimodal Fusion

MFVC: Urban Traffic Scene Video Caption Based on Multimodal Fusion

Robustness of Differentiable Neural Computer Using Limited Retention Vector-based Memory Deallocation in Language Model

Video Captioning Using Deep Learning Approach-A Comprehensive Survey

Contact Info

Product

Resources

About