Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-37
|View full text |Cite
|
Sign up to set email alerts
|

Attention-Based Cross-Modal Fusion for Audio-Visual Voice Activity Detection in Musical Video Streams

Abstract: Many previous audio-visual voice-related works focus on speech, ignoring the singing voice in the growing number of musical video streams on the Internet. For processing diverse musical video data, voice activity detection is a necessary step. This paper attempts to detect the speech and singing voices of target performers in musical video streams using audiovisual information. To integrate information of audio and visual modalities, a multi-branch network is proposed to learn audio and image representations, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(1 citation statement)
references
References 23 publications
0
1
0
Order By: Relevance
“…Audio event classification (AEC) performs multi-label classification on an audio clip and aims to identify target events in the audio clip. ASC and AEC-related systems are used in various applications such as medical surveillance [1] and video analysis [2].…”
Section: Introductionmentioning
confidence: 99%
“…Audio event classification (AEC) performs multi-label classification on an audio clip and aims to identify target events in the audio clip. ASC and AEC-related systems are used in various applications such as medical surveillance [1] and video analysis [2].…”
Section: Introductionmentioning
confidence: 99%