Overlap-Aware Diarization: Resegmentation Using Neural End-to-End Overlapped Speech Detection

Bullock, Latané; Bredin, Hervé; García-Perera, Leibny Paola

doi:10.1109/icassp40776.2020.9053096

Cited by 65 publications

(73 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We evaluated two approaches for assigning a second speaker. An heuristic that considers the two closest speakers in time [25] and, based on [26], an approach where the second most-likely speaker of the output of VB-HMM diarization is used to provide the second label, but applied using x-vectors as input frames instead of melfrequency cepstral coefficients. Given the current pipeline, obtaining the second label is quite straightforward as we simply need to output the two most likely speakers for each frame.…”

Section: Overlapped Speech Handlingmentioning

confidence: 99%

Analysis of the but Diarization System for Voxconverse Challenge

Landini

Glembek

Matějka

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This paper describes the system developed by the BUT team for the fourth track of the VoxCeleb Speaker Recognition Challenge, focusing on diarization on the VoxConverse dataset. The system consists of signal pre-processing, voice activity detection, speaker embedding extraction, an initial agglomerative hierarchical clustering followed by diarization using a Bayesian hidden Markov model, a reclustering step based on per-speaker global embeddings and overlapped speech detection and handling. We provide comparisons for each of the steps and share the implementation of the most relevant modules of our system. Our system scored second in the challenge in terms of the primary metric (diarization error rate) and first according to the secondary metric (Jaccard error rate).

show abstract

Section: Overlapped Speech Handlingmentioning

confidence: 99%

Analysis of the but Diarization System for Voxconverse Challenge

Landini

Glembek

Matějka

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…(3) We show that the proposed diarization method achieves state-of-the-art accuracy, slightly outperforming the previous best result [3] on the AMI Headset Mix corpus.…”

Section: Introductionmentioning

confidence: 91%

“…Multi-person speaker diarization: The most common approach to speaker diarization with simultaneous speech is to use an overlapping speech detector; for those segments that contain overlap, the set of speakers can be estimated [3,9,10,11,12]. For the latter step, one approach is to select the top k closest speakers in the embedding space.…”

Section: Related Workmentioning

confidence: 99%

“…For example, to test whether x contains speaker a, speaker b, or speakers a&b, we compare f (x) toē a ,ē b ,ē ab and output the set that minimizes the distance. Alternatively, if |T | = k is already known, we can pick the k speakers whose enrollmentsē s are closest to f (x); this is the method used in [3] during overlapping speech.…”

Section: Experiments 1: Multi-person Speaker Identification (Librispeech)mentioning

confidence: 99%

“…[15], one of the largest diarization datasets. In the test set, 81% of the speech is non-overlapping, and 19% is overlapping [3].…”

Section: Experiments 2: Speaker Diarization (Ami)mentioning

confidence: 99%

See 2 more Smart Citations

Compositional Embedding Models for Speaker Identification and Diarization with Simultaneous Speech From 2+ Speakers

Whitehill

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We propose a new method for speaker diarization that can handle overlapping speech with 2+ people. Our method is based on compositional embeddings [1]: Like standard speaker embedding methods such as x-vector [2], compositional embedding models contain a function f that separates speech from different speakers. In addition, they include a composition function g to compute set-union operations in the embedding space so as to infer the set of speakers within the input audio. In an experiment on multi-person speaker identification using synthesized LibriSpeech data, the proposed method outperforms traditional embedding methods that are only trained to separate single speakers (not speaker sets). In a speaker diarization experiment on the AMI Headset Mix corpus, we achieve state-of-the-art accuracy (DER=22.93%), slightly better than the previous best result (23.82% from [3]).

show abstract