Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-149
|View full text |Cite
|
Sign up to set email alerts
|

Three-Class Overlapped Speech Detection Using a Convolutional Recurrent Neural Network

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 12 publications
(12 citation statements)
references
References 0 publications
0
12
0
Order By: Relevance
“…The conventional cascaded approach for speaker diarization consists of the following operations: 1) speech activity detection (SAD), 2) speaker embedding extraction from each detected speech segment, 3) clustering of the embeddings, and 4) optional overlap handling. The oracle SAD is sometimes used in the experiments, but the remaining parts are actively being studied in the literature: better speaker embedding extraction methods [35]- [38], clustering methods [11], [13], [39], and overlap assignment methods [22], [40], [41]. The cascaded approach is based on unsupervised clustering; thus, the number of output speakers can take an arbitrary value and can be set flexibly during inference.…”
Section: A Offline Diarizationmentioning
confidence: 99%
“…The conventional cascaded approach for speaker diarization consists of the following operations: 1) speech activity detection (SAD), 2) speaker embedding extraction from each detected speech segment, 3) clustering of the embeddings, and 4) optional overlap handling. The oracle SAD is sometimes used in the experiments, but the remaining parts are actively being studied in the literature: better speaker embedding extraction methods [35]- [38], clustering methods [11], [13], [39], and overlap assignment methods [22], [40], [41]. The cascaded approach is based on unsupervised clustering; thus, the number of output speakers can take an arbitrary value and can be set flexibly during inference.…”
Section: A Offline Diarizationmentioning
confidence: 99%
“…Usually the output frame rate is set to 10 ms, and the output is binary with 1 meaning the presence of overlapped speech and 0 otherwise. Some works combined an overlapped speech detector with a voice activity detector or a speaker counter, thus using more than 2 classes [6,7].…”
Section: Overlapped Speech Detectionmentioning
confidence: 99%
“…Following this trend, OSD systems based on convolutional layers are becoming frequent [13], granting results as good as the one obtained with recurrent layers, with smaller training duration. Some OSD systems combine recurrent and convolutional layers to improve performances [6]. Finally, the Temporal Convoluted Network (TCN) originally developed for sequence modelling [14] have been adapted for speaker counting in overlapped speech [15].…”
Section: Overlapped Speech Detectionmentioning
confidence: 99%
“…Speaker diarization 1) Offline diarization: The conventional cascaded approach for speaker diarization consists of the following operations: 1) speech activity detection (SAD), 2) speaker embedding extraction from each detected speech segment, 3) clustering of the embeddings, and 4) optional overlap handling. The oracle SAD is sometimes used in the experiments, but the remaining parts are actively being studied in the literature: better speaker embedding extraction methods [34]- [37], clustering methods [11], [14], [38], and overlap assignment methods [22], [39], [40]. The cascaded approach is based on unsupervised clustering; thus, the number of output speakers can take an arbitrary value and can be set flexibly during inference.…”
Section: Related Workmentioning
confidence: 99%