Shan-Shan Dong scite author profile

Shan-Shan Dong

3Publications

0Citation Statements Received

0Citation Statements Given

How they've been cited

How they cite others

Affiliations

Shandong University

Publications

Order By: Most citations

Semantic Enhanced Video Captioning with Multi-feature Fusion

Niu

Dong

Chen

et al. 2023

ACM Trans. Multimedia Comput. Commun. Appl.

View full text Add to dashboard Cite

Video captioning aims to automatically describe a video clip with informative sentences. At present, deep learning-based models have become the mainstream for this task and achieved competitive results on public datasets. Usually, these methods leverage different types of features to generate sentences, e.g. semantic information, 2D or 3D features. However, some methods only treat semantic information as a complement of visual representations, and cannot fully exploit it; some of them ignore the relationship between different types of features. In addition, most of them select multiple frames of a video with an equally spaced sampling scheme, resulting in much redundant information. To address these issues, we present a novel video captioning framework, Semantic Enhanced video captioning with Multi-feature Fusion, SEMF for short. It optimizes the use of different types of features from three aspects. First of all, a semantic encoder is designed to enhance meaningful semantic features through a semantic dictionary to boost performance. Secondly, a discrete selection module pays attention to important features and obtains different contexts at different steps to reduce feature redundancy. Finally, a multi-feature fusion module uses a novel relation-aware attention mechanism to separate the common and complementary components of different features to provide more effective visual features for the next step. Moreover, the entire framework can be trained in an end-to-end manner. Extensive experiments are conducted on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSR-VTT) datasets. The results demonstrate that SEMF is able to achieve state-of-the-art results.

show abstract

Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning

Dong

Niu

Luo

et al. 2023

ACM Trans. Multimedia Comput. Commun. Appl.

View full text Add to dashboard Cite

Video captioning which bridges vision and language is a fundamental yet challenging task in computer vision. To generate accurate and comprehensive sentences, both visual and semantic information is quite important. However, most existing methods simply concatenate different types of features and ignore the interactions between them. In addition, there is a large semantic gap between visual feature space and semantic embedding space, making the task much challenging. To address these issues, we propose a framework named semantic embedding guided attention with Explicit visual Feature Fusion for vidEo CapTioning, EFFECT for short, in which we design an explicit visual-feature fusion (EVF) scheme to capture the pairwise interactions between multiple visual modalities and fuse multimodal visual features of videos in an explicit way. Furthermore, we propose a novel attention mechanism called semantic embedding guided attention (SEGA), which cooperates with the temporal attention to generate a joint attention map. In specific, in SEGA, the semantic word embedding information is leveraged to guide the model to pay more attention to the most correlated visual features at each decoding stage. In this way, the semantic gap between visual and semantic space is alleviated to some extent. To evaluate the proposed model, we conduct extensive experiments on two widely used datasets, i.e. MSVD and MSR-VTT. The experimental results demonstrate that our approach achieves state-of-the-art results in terms of four evaluation metrics.

show abstract

A multi-layer memory sharing network for video captioning

Niu

Dong

Chen

et al. 2023

Pattern Recognition

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Shan-Shan Dong

Semantic Enhanced Video Captioning with Multi-feature Fusion

Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning

A multi-layer memory sharing network for video captioning

Contact Info

Product

Resources

About