Cross-modal video moment retrieval based on visual-textual relationship alignment

Chen, Joya; Du, Hao; Wu, Yufei; Xu, Tong Bill; Chen, Enhong

doi:10.1360/ssi-2019-0292

Cited by 8 publications

References 44 publications

(79 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

Detecting Highlighted Video Clips Through Emotion-Enhanced Audio-Visual Cues

et al. 2021

2021 IEEE International Conference on Multimedia and Expo (ICME)

Self Cite

View full text Add to dashboard Cite

Recent years have witnessed the growing research interests in video highlight detection. Existing studies mainly focus on detecting highlights in user-generated videos with simple topics based on visual content. However, relying solely on visual features limits the ability of conventional methods to capture highlights for videos with more complicated semantics, like movies. Therefore, we propose to mine the emotional information in video sounds to enhance highlight detection. Specifically, we design a novel emotion-enhanced framework with multi-stage fusion to detect highlights for complex videos. Along this line, we first extract multi-grained features from the audio waves. Then, the tailored-designed intra-modal fusion is applied on audio features to obtain emotional representation. Furthermore, the cross-modal fusion is developed to generate comprehensive representation of clip by merging audio emotional representations and visual features. This representation can be leveraged for predicting highlight probability. Finally, extensive experiments on realworld datasets demonstrate the effectiveness of our method.

show abstract