2022
DOI: 10.1007/978-3-031-19836-6_8
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Active Speaker Detection

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
10
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2
1

Relationship

1
5

Authors

Journals

citations
Cited by 13 publications
(12 citation statements)
references
References 39 publications
0
10
0
Order By: Relevance
“…Many active speaker detection methods use 3D convolutional neural networks as visual feature encoders [3,18,44,47]. Although 3D convolution can effectively extract the spatio-temporal information of face sequences, it has a large number of model parameters and the computational cost is very expensive.…”
Section: Visual Feature Encodermentioning
confidence: 99%
See 3 more Smart Citations
“…Many active speaker detection methods use 3D convolutional neural networks as visual feature encoders [3,18,44,47]. Although 3D convolution can effectively extract the spatio-temporal information of face sequences, it has a large number of model parameters and the computational cost is very expensive.…”
Section: Visual Feature Encodermentioning
confidence: 99%
“…For each method, we copy the results from its original paper or calculate from the open-source code. Some studies [3,9,43,46] are not yet open source, so we only estimate the parameters and FLOPs of their audio-visual encoder. The E2E indicates end-to-end.…”
Section: Loss Functionmentioning
confidence: 99%
See 2 more Smart Citations
“…Nowadays, millions of videos are produced every day, and high demand arises for automatic video processing and analysis. To this end, various tasks have emerged, for example, action recognition [19], active speaker detection [2], videolanguage grounding [41], temporal action localization [26,42]. Among those tasks, temporal action detection in untrimmed videos, in particular, is one of the fundamental yet challenging tasks.…”
mentioning
confidence: 99%