Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1893
|View full text |Cite
|
Sign up to set email alerts
|

Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge

Abstract: We describe in this paper the experiences of the Johns Hopkins University team during the inaugural DIHARD diarization evaluation. This new task provided microphone recordings in a variety of difficult conditions and challenged researchers to fully consider all speaker activity, without the currently typical practices of unscored collars or ignored overlapping speaker segments. This paper explores several key aspects of currently state-of-the-art diarization methods, such as training data selection, signal ban… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

3
192
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
3
1

Relationship

2
6

Authors

Journals

citations
Cited by 182 publications
(195 citation statements)
references
References 13 publications
3
192
0
Order By: Relevance
“…Then, we explain the transfer learning approach followed to perform SER. It is shown in the literature that i-vectors [17], speaker diarization [18,19,20,21]. In this work, we only exploit the x-vector model because of its superiority over i-vectors [22] and also because it is easy to adapt for down-stream tasks.…”
Section: Our Approachmentioning
confidence: 99%
“…Then, we explain the transfer learning approach followed to perform SER. It is shown in the literature that i-vectors [17], speaker diarization [18,19,20,21]. In this work, we only exploit the x-vector model because of its superiority over i-vectors [22] and also because it is easy to adapt for down-stream tasks.…”
Section: Our Approachmentioning
confidence: 99%
“…To obtain diarization results with online RSAN, we performed power-based voice activity detection (VAD) on extracted streams based on a threshold value common to one meeting. In the evaluation with the simulated meeting-like data, the performance of online RSAN was compared with a system similar to a top-performing system [7] in DIHARD-1 challenge [2]. For this, we used off-the-shelf implementation and model from [20].…”
Section: Implementation Details Of Online Rsanmentioning
confidence: 99%
“…Then, the correct association of speaker identity information among blocks, i.e., diarization results, is estimated by clustering these features by using e.g. agglomerative hierarchical clustering [7]. Although these conventional algorithms can achieve reasonable diarization performance, the results are not guaranteed to be optimal, because the steps concerning speaker identity feature extraction and clustering are done independently.…”
Section: Introductionmentioning
confidence: 99%
“…Speaker diarization, the process of partitioning an input audio stream into homogeneous segments according to the speaker identity [1][2][3][4] (often referred as "who spoke when"), is an important pre-processing step for many speech applications. As shown in Figure 1 left, a standard diarization system [5][6][7][8] consists of four steps. (1) Segmentation: this step removes the nonspeech portion of the audio with speech activity detection (SAD), and the speech regions are further cut into short segments.…”
Section: Introductionmentioning
confidence: 99%
“…We refer the PyTorch implementation of Faster R-CNN in[36] 5. We also train a x-vector model on single channel data of Mixer 6 + SRE + SWBD as a fair comparison but the performance is slightly worse than Kaldi's diarization model (http://kaldi-asr.org/models/m6).…”
mentioning
confidence: 99%