Look Who’s Not Talking

Kwon, Youngki; Heo, Hee Soo; Huh, Jaesung; Lee, Bong-Jin; Chung, Joon Son

doi:10.1109/slt48900.2021.9383502

Cited by 5 publications

(5 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, magnitudes can also be utilized for the out-of-distribution detection in the embedding space. This property can possibly be employed as an additional layer of defense to compensate for failures of a voice activity detector, as suggested by [22].…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Magnitude-aware Probabilistic Speaker Embeddings

Kuzmin,

Fedorov,

Sholokhov

2022

Preprint

View full text Add to dashboard Cite

Recently, hyperspherical embeddings have established themselves as a dominant technique for face and voice recognition. Specifically, Euclidean space vector embeddings are learned to encode person-specific information in their direction while ignoring the magnitude. However, recent studies have shown that the magnitudes of the embeddings extracted by deep neural networks may indicate the quality of the corresponding inputs. This paper explores the properties of the magnitudes of the embeddings related to quality assessment and out-of-distribution detection. We propose a new probabilistic speaker embedding extractor using the information encoded in the embedding magnitude and leverage it in the speaker verification pipeline. We also propose several quality-aware diarization methods and incorporate the magnitudes in those. Our results indicate significant improvements over magnitude-agnostic baselines both in speaker verification and diarization tasks.

show abstract

Section: Discussionmentioning

confidence: 99%

“…It is worth mentioning that the magnitudes of speaker embeddings were already successfully applied for the voice activity detection task [22]. However, this work lacks any qualitative analysis of embedding magnitudes properties.…”

Section: Magnitude-aware Embeddingsmentioning

confidence: 99%

Magnitude-aware Probabilistic Speaker Embeddings

Kuzmin,

Fedorov,

Sholokhov

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Therefore, instead of tuning the threshold for each domain data, we adopt clustering with a silhouette coefficient trick. Some studies [10,11,24,25] already composed their clustering-based SD systems using silhouette coefficient, and those systems show superior performance on various datasets without threshold tuning.…”

Section: Initial Clustering Phasementioning

confidence: 99%

“…Speaker diarisation (SD), which segments input audio to short utterances according to speaker identity, is going through a rapid breakthrough [1,2]. Based on the success of recent SD systems [3][4][5][6][7][8][9][10][11][12], online SD systems are also being developed [13][14][15][16][17][18][19][20]. In an online SD system, the system should decide the speaker label of a given short segment leveraging only current and past segments, where only a part of past segments are available.…”

Section: Introductionmentioning

confidence: 99%

Absolute decision corrupts absolutely: conservative online speaker diarisation

Kwon¹,

Heo²,

Lee³

et al. 2022

Preprint

View full text Add to dashboard Cite

Our focus lies in developing an online speaker diarisation framework which demonstrates robust performance across diverse domains. In online speaker diarisation, outputs generated in real-time are irreversible, and a few misjudgements in the early phase of an input session can lead to catastrophic results. We hypothesise that cautiously increasing the number of estimated speakers is of paramount importance among many other factors. Thus, our proposed framework includes decreasing the number of speakers by one when the system judges that an increase in the past was faulty. We also adopt dual buffers, checkpoints and centroids, where checkpoints are combined with silhouette coefficients to estimate the number of speakers and centroids represent speakers. Again, we believe that more than one centroid can be generated from one speaker. Thus we design a clustering-based label matching technique to assign labels in realtime. The resulting system is lightweight yet surprisingly effective. The system demonstrates state-of-the-art performance on DIHARD II and III datasets, where it is also competitive in AMI and VoxConverse test sets.

show abstract

“…The former "divides-and-conquers" speaker diarisation into several subtasks. The exact configuration differs from system to system, but in general they consist of speech activity detection (SAD), embedding extraction and clustering [2][3][4]. The latter directly segments audio recordings into homogeneous speaker regions using deep neural networks [5][6][7][8].…”

Section: Introductionmentioning

confidence: 99%

Disentangled dimensionality reduction for noise-robust speaker diarisation

Kim¹,

Heo²,

Jung³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

The objective of this work is to train noise-robust speaker embeddings for speaker diarisation. Speaker embeddings play a crucial role in the performance of diarisation systems, but they often capture spurious information such as noise and reverberation, adversely affecting performance. Our previous work have proposed an autoencoder-based dimensionality reduction module to help remove the spurious information. However, they do not explicitly separate such information and have also been found to be sensitive to hyperparameter values. To this end, we propose two contributions to overcome these issues: (i) a novel dimensionality reduction framework that can disentangle spurious information from the speaker embeddings; (ii) the use of a speech/non-speech indicator to prevent the speaker code from learning from the background noise. Through a range of experiments conducted on four different datasets, our approach consistently demonstrates the state-of-the-art performance among models that do not adopt ensembles.

show abstract

Look Who’s Not Talking

Cited by 5 publications

References 38 publications

Magnitude-aware Probabilistic Speaker Embeddings

Magnitude-aware Probabilistic Speaker Embeddings

Absolute decision corrupts absolutely: conservative online speaker diarisation

Disentangled dimensionality reduction for noise-robust speaker diarisation

Contact Info

Product

Resources

About