Three-Class Overlapped Speech Detection Using a Convolutional Recurrent Neural Network

Jung, Jee-weon; Heo, Hee-Soo; Kwon, Youngki; Chung, Joon Son; Lee, Bong-Jin

doi:10.21437/interspeech.2021-149

Cited by 12 publications

(12 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The conventional cascaded approach for speaker diarization consists of the following operations: 1) speech activity detection (SAD), 2) speaker embedding extraction from each detected speech segment, 3) clustering of the embeddings, and 4) optional overlap handling. The oracle SAD is sometimes used in the experiments, but the remaining parts are actively being studied in the literature: better speaker embedding extraction methods [35]- [38], clustering methods [11], [13], [39], and overlap assignment methods [22], [40], [41]. The cascaded approach is based on unsupervised clustering; thus, the number of output speakers can take an arbitrary value and can be set flexibly during inference.…”

Section: A Offline Diarizationmentioning

confidence: 99%

Online Neural Diarization of Unlimited Numbers of Speakers Using Global and Local Attractors

Horiguchi

Watanabe

García

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

A method to perform offline and online speaker diarization for an unlimited number of speakers is described in this paper. End-to-end neural diarization (EEND) has achieved overlap-aware speaker diarization by formulating it as a multilabel classification problem. It has also been extended for a flexible number of speakers by introducing speaker-wise attractors. However, the output number of speakers of attractorbased EEND is empirically capped; it cannot deal with cases where the number of speakers appearing during inference is higher than that during training because its speaker counting is trained in a fully supervised manner. Our method, EEND-GLA, solves this problem by introducing unsupervised clustering into attractor-based EEND. In the method, the input audio is first divided into short blocks, then attractor-based diarization is performed for each block, and finally, the results of each block are clustered on the basis of the similarity between locallycalculated attractors. While the number of output speakers is limited within each block, the total number of speakers estimated for the entire input can be higher than the limitation. To use EEND-GLA in an online manner, our method also extends the speaker-tracing buffer, which was originally proposed to enable online inference of conventional EEND. We introduce a blockwise buffer update to make the speaker-tracing buffer compatible with EEND-GLA. Finally, to improve online diarization, our method improves the buffer update method and revisits the variable chunk-size training of EEND. The experimental results demonstrate that EEND-GLA can perform speaker diarization of an unseen number of speakers in both offline and online inferences.

show abstract

Section: A Offline Diarizationmentioning

confidence: 99%

Online Neural Diarization of Unlimited Numbers of Speakers Using Global and Local Attractors

Horiguchi

Watanabe

García

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…Usually the output frame rate is set to 10 ms, and the output is binary with 1 meaning the presence of overlapped speech and 0 otherwise. Some works combined an overlapped speech detector with a voice activity detector or a speaker counter, thus using more than 2 classes [6,7].…”

Section: Overlapped Speech Detectionmentioning

confidence: 99%

“…Following this trend, OSD systems based on convolutional layers are becoming frequent [13], granting results as good as the one obtained with recurrent layers, with smaller training duration. Some OSD systems combine recurrent and convolutional layers to improve performances [6]. Finally, the Temporal Convoluted Network (TCN) originally developed for sequence modelling [14] have been adapted for speaker counting in overlapped speech [15].…”

Section: Overlapped Speech Detectionmentioning

confidence: 99%

Overlapped speech and gender detection with WavLM pre-trained features

Lebourdais¹,

Tahon²,

Laurent³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

This article focuses on overlapped speech and gender detection in order to study interactions between women and men in French audiovisual media (Gender Equality Monitoring project). In this application context, we need to automatically segment the speech signal according to speakers gender, and to identify when at least two speakers speak at the same time. We propose to use WavLM model which has the advantage of being pre-trained on a huge amount of speech data, to build an overlapped speech detection (OSD) and a gender detection (GD) systems. In this study, we use two different corpora. The DIHARD III corpus which is well adapted for the OSD task but lack gender information. The ALLIES corpus fits with the project application context. Our best OSD system is a Temporal Convolutional Network (TCN) with WavLM pre-trained features as input, which reaches a new state-of-the-art F1-score performance on DIHARD. A neural GD is trained with WavLM inputs on a gender balanced subset of the French broadcast news ALLIES data, and obtains an accuracy of 94.9%. This work opens new perspectives for human science researchers regarding the differences of representation between women and men in French media.

show abstract

“…Speaker diarization 1) Offline diarization: The conventional cascaded approach for speaker diarization consists of the following operations: 1) speech activity detection (SAD), 2) speaker embedding extraction from each detected speech segment, 3) clustering of the embeddings, and 4) optional overlap handling. The oracle SAD is sometimes used in the experiments, but the remaining parts are actively being studied in the literature: better speaker embedding extraction methods [34]- [37], clustering methods [11], [14], [38], and overlap assignment methods [22], [39], [40]. The cascaded approach is based on unsupervised clustering; thus, the number of output speakers can take an arbitrary value and can be set flexibly during inference.…”

Section: Related Workmentioning

confidence: 99%