End-to-End Active Speaker Detection

Alcázar, Juan León; Cordes, Moritz; Zhao, Chen; Ghanem, Bernard

doi:10.1007/978-3-031-19836-6_8

Cited by 13 publications

(12 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many active speaker detection methods use 3D convolutional neural networks as visual feature encoders [3,18,44,47]. Although 3D convolution can effectively extract the spatio-temporal information of face sequences, it has a large number of model parameters and the computational cost is very expensive.…”

Section: Visual Feature Encodermentioning

confidence: 99%

“…For each method, we copy the results from its original paper or calculate from the open-source code. Some studies [3,9,43,46] are not yet open source, so we only estimate the parameters and FLOPs of their audio-visual encoder. The E2E indicates end-to-end.…”

Section: Loss Functionmentioning

confidence: 99%

“…We compare the performance of our framework with other active speaker detection methods [1][2][3]9,18,22,36,43,44,46] on the AVA-ActiveSpeaker validation set, and summarize these results in Tab. 1.…”

Section: Comparison With the State-of-the-artmentioning

confidence: 99%

“…With the release of the first large-scale active speaker detection dataset AVA-ActiveSpeaker [32], researchers have made a series of significant progress in this field [15,36,37,39,46] following the rapid development of deep learning for audio-visual tasks [21]. These studies improve the performance of active speaker detection by inputting face sequences of multiple candidates at the same time [1,2,46], extracting visual features with 3D convolutional neural networks [3,18,47], modeling cross-modal information with complex attention modules [9,43,44], etc, which brings higher memory and computation requirements. Therefore, existing works are difficult to be applied in scenarios requiring real-time processing with limited memory and computational resources, such as automatic video editing and live television.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A Light Weight Model for Active Speaker Detection

Liao¹,

Duan²,

Kanghui³

et al. 2023

Preprint

View full text Add to dashboard Cite

Active speaker detection is a challenging task in audiovisual scenario understanding, which aims to detect who is speaking in one or more speakers scenarios. This task has received extensive attention as it is crucial in applications such as speaker diarization, speaker tracking, and automatic video editing. The existing studies try to improve performance by inputting multiple candidate information and designing complex models. Although these methods achieved outstanding performance, their high consumption of memory and computational power make them difficult to be applied in resource-limited scenarios. Therefore, we construct a lightweight active speaker detection architecture by reducing input candidates, splitting 2D and 3D convolutions for audio-visual feature extraction, and applying gated recurrent unit (GRU) with low computational complexity for cross-modal modeling. Experimental results on the AVA-ActiveSpeaker dataset show that our framework achieves competitive mAP performance (94.1% vs. 94.2%), while the resource costs are significantly lower than the state-of-the-art method, especially in model parameters (1.0M vs. 22.5M, about 23×) and FLOPs (0.6G vs. 2.6G, about 4×). In addition, our framework also performs well on the Columbia dataset showing good robustness. The code and model weights are available at https: //github.com/Junhua-Liao/Light-ASD.

show abstract

Section: Visual Feature Encodermentioning

confidence: 99%

Section: Loss Functionmentioning

confidence: 99%

Section: Comparison With the State-of-the-artmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Light Weight Model for Active Speaker Detection

Liao¹,

Duan²,

Kanghui³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Nowadays, millions of videos are produced every day, and high demand arises for automatic video processing and analysis. To this end, various tasks have emerged, for example, action recognition [19], active speaker detection [2], videolanguage grounding [41], temporal action localization [26,42]. Among those tasks, temporal action detection in untrimmed videos, in particular, is one of the fundamental yet challenging tasks.…”

mentioning

confidence: 99%

SegTAD: Precise Temporal Action Detection via Semantic Segmentation

Zhao

Ramazanova

et al. 2023

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Temporal action detection (TAD) is an important yet challenging task in video analysis. Most existing works draw inspiration from image object detection and tend to reformulate it as a proposal generation -classification problem. However, there are two caveats with this paradigm. First, proposals are not equipped with annotated labels, which have to be empirically compiled, thus the information in the annotations is not necessarily precisely employed in the model training process. Second, there are large variations in the temporal scale of actions, and neglecting this fact may lead to deficient representation in the video features. To address these issues and precisely model TAD, we formulate the task in a novel perspective of semantic segmentation. Owing to the 1dimensional property of TAD, we are able to convert the coarse-grained detection annotations to fine-grained semantic segmentation annotations for free. We take advantage of them to provide precise supervision so as to mitigate the impact induced by the imprecise proposal labels. We propose a unified framework SegTAD composed of a 1D semantic segmentation network (1D-SSN) and a proposal detection network (PDN). We evaluate SegTAD on two important large-scale datasets for action detection and it shows competitive performance on both datasets.

show abstract

AS-Net: active speaker detection using deep audio-visual attention

Radman,

Laaksonen

2024

Multimed Tools Appl

View full text Add to dashboard Cite

Active Speaker Detection (ASD) aims at identifying the active speaker among multiple speakers in a video scene. Previous ASD models often seek audio and visual features from long video clips with a complex 3D Convolutional Neural Network (CNN) architecture. However, models based on 3D CNNs can generate discriminative spatial-temporal features, but this comes at the expense of computational complexity, and they frequently face challenges in detecting active speakers in short video clips. This work proposes the Active Speaker Network (AS-Net) model, a simple yet effective ASD method tailored for detecting active speakers in relatively short video clips without relying on 3D CNNs. Instead, it incorporates the Temporal Shift Module (TSM) into 2D CNNs, facilitating the extraction of dense temporal visual features without the need for additional computations. Moreover, self-attention and cross-attention schemes are introduced to enhance long-term temporal audio-visual synchronization, thereby improving ASD performance. Experimental results demonstrate that AS-Net outperforms state-of-the-art 2D CNN-based methods on the AVA-ActiveSpeaker dataset and remains competitive with the methods utilizing more complex architectures.

show abstract

End-to-End Active Speaker Detection

Cited by 13 publications

References 39 publications

A Light Weight Model for Active Speaker Detection

A Light Weight Model for Active Speaker Detection

SegTAD: Precise Temporal Action Detection via Semantic Segmentation

AS-Net: active speaker detection using deep audio-visual attention

Contact Info

Product

Resources

About