Spherediar: An Effective Speaker Diarization System for Meeting Data

Kaseva, Tuomas; Rouhe, Aku; Kurimo, Mikko

doi:10.1109/asru46091.2019.9003967

Cited by 4 publications

(7 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We publish a new, end-to-end SLI toolkit for running multiple SLI experiments on multiple datasets, implement seven existing SLI architectures on our toolkit, and run experiments on three SLI datasets. We implement the SphereSpeaker speaker recognition architecture [14] on our toolkit and apply it to SLI for the first time. We release our toolkit online as free open source software 2 .…”

Section: Contributions Of This Papermentioning

confidence: 99%

“…4. SphereSpeaker architecture, that has recently been successful for speaker recognition [14], now applied to SLI.…”

Section: End-to-end Experimentsmentioning

confidence: 99%

“…See Figure 1 for an example on channel dropout applied on FBANK input. [4,27,6,11], partially used in 2 out of 7 cases [19,14], and not used in [28]. Therefore, we de-Layer Output shape…”

Section: End-to-end Experimentsmentioning

confidence: 99%

“…For model 3, we choose x as the output of the first FC layer after the second BGRU layer. For model 4, x is the L2-normalized output of the SphereSpeaker embedding layer [14]. Classification For all three training and test sets, we feed X to the seven different, trained end-to-end models and collect new training and test sets of language vectors x.…”

Section: Back-end Classifiersmentioning

confidence: 99%

“…An alternative way of discovering embedding spaces is to explicitly map the embedded vectors onto a hypersphere by L2-normalization, where the angular distance of embedding vectors imply class similarity. This approach has outperformed i-vector based systems both in SLI [13] and speaker recognition [14].…”

Section: Introductionmentioning

confidence: 96%

See 4 more Smart Citations

Releasing a Toolkit and Comparing the Performance of Language Embeddings Across Various Spoken Language Identification Datasets

Lindgren¹,

Jauhiainen²,

Kurimo³

2020

Interspeech 2020

Self Cite

View full text Add to dashboard Cite

In this paper, we propose a software toolkit for easier end-toend training of deep learning based spoken language identification models across several speech datasets. We apply our toolkit to implement three baseline models, one speaker recognition model, and three x-vector architecture variations, which are trained on three datasets previously used in spoken language identification experiments. All models are trained separately on each dataset (closed task) and on a combination of all datasets (open task), after which we compare if the open task training yields better language embeddings. We begin by training all models end-to-end as discriminative classifiers of spectral features, labeled by language. Then, we extract language embedding vectors from the trained end-to-end models, train separate Gaussian Naive Bayes classifiers on the vectors, and compare which model provides best language embeddings for the backend classifier. Our experiments show that the open task condition leads to improved language identification performance on only one of the datasets. In addition, we discovered that increasing x-vector model robustness with random frequency channel dropout significantly reduces its end-to-end classification performance on the test set, while not affecting back-end classification performance of its embeddings. Finally, we note that two baseline models consistently outperformed all other models.

show abstract

Section: Contributions Of This Papermentioning

confidence: 99%

“…4. SphereSpeaker architecture, that has recently been successful for speaker recognition [14], now applied to SLI.…”

Section: End-to-end Experimentsmentioning

confidence: 99%

“…See Figure 1 for an example on channel dropout applied on FBANK input. [4,27,6,11], partially used in 2 out of 7 cases [19,14], and not used in [28]. Therefore, we de-Layer Output shape…”

Section: End-to-end Experimentsmentioning

confidence: 99%

Section: Back-end Classifiersmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 96%

See 3 more Smart Citations

Releasing a Toolkit and Comparing the Performance of Language Embeddings Across Various Spoken Language Identification Datasets

Lindgren¹,

Jauhiainen²,

Kurimo³

2020

Interspeech 2020

Self Cite

View full text Add to dashboard Cite

show abstract

Multimodal System for Audio Scene Source Counting and Analysis

Nigro

Krishnan²

2022

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

This thesis explores audio scene analysis (ASA) for determining the number of active sources in an audio scene, a task that is defined as audio source counting. A first of its kind dataset called SARdB is produced with audio and text modalities, and annotations for the number of speakers and the number of sound events present in an audio recording. For speaker counting, an audio-based ResNet-34 and text-based Bidirectional Long Short-Term Memory (BLSTM) network set a baseline prediction accuracy of 46.03% and 89.57% when considering a margin of error of one speaker, while outperforming various state-of-the-art systems in speaker counting. Another audio-based ResNet-34 model demonstrates the optimal result for sound event counting at 50.55% prediction accuracy and 86.59% accuracy with a margin of error of one sound event. The proposed method for source counting is also shown to perform in real-time with an overall processing time of ∼0.4614s.

show abstract

Conditional Spoken Digit Generation with StyleGAN

Palkama¹,

Juvela

Ilin

2020

Interspeech 2020

View full text Add to dashboard Cite

This paper adapts a StyleGAN model for speech generation with minimal or no conditioning on text. StyleGAN is a multiscale convolutional GAN capable of hierarchically capturing data structure and latent variation on multiple spatial (or temporal) levels. The model has previously achieved impressive results on facial image generation, and it is appealing to audio applications due to similar multi-level structures present in the data. In this paper, we train a StyleGAN to generate melspectrograms on the Speech Commands dataset, which contains spoken digits uttered by multiple speakers in varying acoustic conditions. In a conditional setting our model is conditioned on the digit identity, while learning the remaining data variation remains an unsupervised task. We compare our model to the current unsupervised state-of-the-art speech synthesis GAN architecture, the WaveGAN, and show that the proposed model outperforms according to numerical measures and subjective evaluation by listening tests.

show abstract

Spherediar: An Effective Speaker Diarization System for Meeting Data

Cited by 4 publications

References 27 publications

Releasing a Toolkit and Comparing the Performance of Language Embeddings Across Various Spoken Language Identification Datasets

Releasing a Toolkit and Comparing the Performance of Language Embeddings Across Various Spoken Language Identification Datasets

Multimodal System for Audio Scene Source Counting and Analysis

Conditional Spoken Digit Generation with StyleGAN

Contact Info

Product

Resources

About