Attention-Based Cross-Modal Fusion for Audio-Visual Voice Activity Detection in Musical Video Streams

Hou, Yuanbo; Yu, Zhesong; Liang, Xia; Du, Xingjian; Zhu, Beien; Ma, Zejun; Botteldooren, Dick

doi:10.21437/interspeech.2021-37

Cited by 3 publications

(1 citation statement)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Audio event classification (AEC) performs multi-label classification on an audio clip and aims to identify target events in the audio clip. ASC and AEC-related systems are used in various applications such as medical surveillance [1] and video analysis [2].…”

Section: Introductionmentioning

confidence: 99%

Cooperative Scene-Event Modelling for Acoustic Scene Classification

Hou,

Kang,

Mitchell

et al. 2024

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Acoustic scene classification (ASC) can be helpful for creating context awareness for intelligent robots. Humans naturally use the relations between acoustic scenes (AS) and audio events (AE) to understand and recognize their surrounding environments. However, in most previous works, ASC and audio event classification (AEC) are treated as independent tasks, with a focus primarily on audio features shared between scenes and events, but not their implicit relations. To address this limitation, we propose a cooperative scene-event modelling (cSEM) framework to automatically model the intricate scene-event relation by an adaptive coupling matrix to improve ASC. Compared with other scene-event modelling frameworks, the proposed cSEM offers the following advantages. First, it reduces the confusion between similar scenes by aligning the information of coarsegrained AS and fine-grained AE in the latent space, and reducing the redundant information between the AS and AE embeddings. Second, it exploits the relation information between AS and AE to improve ASC, which is shown to be beneficial, even if the information of AE is derived from unverified pseudo-labels. Third, it uses a regression-based loss function for cooperative modelling of scene-event relations, which is shown to be more effective than classification-based loss functions. Instantiated from four models based on either Transformer or convolutional neural networks, cSEM is evaluated on real-life and synthetic datasets. Experiments show that cSEM-based models work well in reallife scene-event analysis, offering competitive results on ASC as compared with other multi-feature or multi-model ensemble methods. The ASC accuracy achieved on the TUT2018, TAU2019, and JSSED datasets is 81.0%, 88.9% and 97.2%, respectively.

show abstract

Section: Introductionmentioning

confidence: 99%