2021
DOI: 10.48550/arxiv.2106.11411
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 27 publications
0
1
0
Order By: Relevance
“…(AV) multi-modal has been applied widely in speech community [6][7][8][9][10][11][12]. The visual information obtained by analyzing lip shapes or facial expressions of the visual modality is more robust than the audio information from complex scenarios.…”
Section: Introductionmentioning
confidence: 99%
“…(AV) multi-modal has been applied widely in speech community [6][7][8][9][10][11][12]. The visual information obtained by analyzing lip shapes or facial expressions of the visual modality is more robust than the audio information from complex scenarios.…”
Section: Introductionmentioning
confidence: 99%