2023
DOI: 10.1109/access.2023.3267668
|View full text |Cite
|
Sign up to set email alerts
|

Efficient Audiovisual Fusion for Active Speaker Detection

Abstract: Active speaker detection (ASD) refers to detecting the speaking person among visible human instances in a video. Existing methods widely employed a similar audiovisual fusion approach, the concatenation. Although such a fusion approach is often argued to help enhance performance, it must be noted that neither feature modalities play an equal role. It forces the backend network to focus on learning intramodal rather than intermodal features. Another concern is that since the concatenation doubles the fused feat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Publication Types

Select...

Relationship

0
0

Authors

Journals

citations
Cited by 0 publications
references
References 70 publications
0
0
0
Order By: Relevance

No citations

Set email alert for when this publication receives citations?