Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams

Hou, Yuanbo; Yu, Zhesong; Liang, Xia; Du, Xingjian; Zhu, Beien; Ma, Zejun; Botteldooren, Dick

doi:10.48550/arxiv.2106.11411

Cited by 1 publication

(1 citation statement)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(AV) multi-modal has been applied widely in speech community [6][7][8][9][10][11][12]. The visual information obtained by analyzing lip shapes or facial expressions of the visual modality is more robust than the audio information from complex scenarios.…”

Section: Introductionmentioning

confidence: 99%

Audio-Visual Wake Word Spotting System For MISP Challenge 2021

Yan¹,

Sun²,

Han³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper presents the details of our system designed for the Task 1 of Multimodal Information Based Speech Processing (MISP) Challenge 2021. The purpose of Task 1 is to leverage both audio and video information to improve the environmental robustness of far-field wake word spotting. In the proposed system, firstly, we take advantage of speech enhancement algorithms such as beamforming and weighted prediction error (WPE) to address the multi-microphone conversational audio. Secondly, several data augmentation techniques are applied to simulate a more realistic far-field scenario. For the video information, the provided region of interest (ROI) is used to obtain visual representation. Then the multi-layer CNN is proposed to learn audio and visual representations, and these representations are fed into our two-branch attention-based network which can be employed for fusion, such as transformer and conformer. The focal loss is used to fine-tune the model and improve the performance significantly. Finally, multiple trained models are integrated by casting vote to achieve our final 0.091 score.

show abstract