Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning

Video captioning aims to automatically generate natural language sentences describing the content of a video. Although encoder-decoder-based models have achieved promising progress, it is still very challenging to effectively model the linguistic behavior of human in generating video captions. In this paper, we propose a novel video captioning model by learning from g L obal s E ntence and looking A hea D , LEAD for short. Specifically, LEAD consists of two modules: a Vision Module (VM) and a Language Module (LM). Thereinto, VM is a novel attention network, which can map visual features to high-level language space and model entire sentences explicitly. LM can not only effectively make use of the information of the previous sequence when generating the current word, but also have a look at the future word. Therefore, based on VM and LM, LEAD can obtain global sentence information and future word information to make video captioning more like a fill-in-the-blank task than a word-by-word sentence generation. In addition, we also propose an autonomous strategy and a multi-stage training scheme to optimize the model, which can mitigate the problem of information leakage. Extensive experiments show that LEAD outperforms some state-of-the-art methods on MSR-VTT, MSVD and VATEX, demonstrating the effectiveness of the proposed approach in video captioning. In addition, we release the code of our proposed model to be publicly available 1 .

show abstract

Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video Commenting

Fu,

Fang,

Chen

et al. 2024

ACM Trans. Multimedia Comput. Commun. Appl.

View full text Add to dashboard Cite

Automatic live video commenting is with increasing attention due to its significance in narration generation, topic explanation, etc. However, the diverse sentiment consideration of the generated comments is missing from the current methods. Sentimental factors are critical in interactive commenting, and lack of research so far. Thus, in this paper, we propose a Sentiment-oriented Transformer-based Variational Autoencoder (So-TVAE) network which consists of a sentiment-oriented diversity encoder module and a batch attention module, to achieve diverse video commenting with multiple sentiments and multiple semantics. Specifically, our sentiment-oriented diversity encoder elegantly combines VAE and random mask mechanism to achieve semantic diversity under sentiment guidance, which is then fused with cross-modal features to generate live video comments. Furthermore, a batch attention module is also proposed in this paper to alleviate the problem of missing sentimental samples, caused by the data imbalance, which is common in live videos as the popularity of videos varies. Extensive experiments on Livebot and VideoIC datasets demonstrate that the proposed So-TVAE outperforms the state-of-the-art methods in terms of the quality and diversity of generated comments. Related code is available at https://github.com/fufy1024/So-TVAE.

show abstract

Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning

Cited by 4 publications

References 61 publications

A Global-Local Contrastive Learning Framework for Video Captioning

A Global-Local Contrastive Learning Framework for Video Captioning

Video Captioning by Learning from Global Sentence and Looking Ahead

Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video Commenting

Contact Info

Product

Resources

About