Designing an Effective Metric Learning Pipeline for Speaker Diarization

Narayanaswamy, Vivek; Thiagarajan, Jayaraman J.; Song, Huan; Spanias, Andreas

doi:10.1109/icassp.2019.8682255

Cited by 19 publications

(13 citation statements)

References 16 publications

(33 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Joint modeling methods have been studied in an effort to alleviate the complex preparation process and take into account the dependencies between these models. They include, for example, joint modeling of x-vector extraction and PLDA scoring [16,31] and joint modeling of SAD and speaker embedding [32]. However, the clustering process has remained unchanged because it is an unsupervised process.…”

Section: Clustering-based Methodsmentioning

confidence: 99%

End-to-End Neural Speaker Diarization with Self-Attention

Fujita

Kanda

Horiguchi

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

164

175

View full text Add to dashboard Cite

Speaker diarization has been mainly developed based on the clustering of speaker embeddings. However, the clustering-based approach has two major problems; i.e., (i) it is not optimized to minimize diarization errors directly, and (ii) it cannot handle speaker overlaps correctly. To solve these problems, the End-to-End Neural Diarization (EEND), in which a bidirectional long short-term memory (BLSTM) network directly outputs speaker diarization results given a multi-talker recording, was recently proposed. In this study, we enhance EEND by introducing self-attention blocks instead of BLSTM blocks. In contrast to BLSTM, which is conditioned only on its previous and next hidden states, self-attention is directly conditioned on all the other frames, making it much suitable for dealing with the speaker diarization problem. We evaluated our proposed method on simulated mixtures, real telephone calls, and real dialogue recordings. The experimental results revealed that the self-attention was the key to achieving good performance and that our proposed method performed significantly better than the conventional BLSTM-based method. Our method was even better than that of the state-of-the-art x-vector clustering-based method. Finally, by visualizing the latent representation, we show that the self-attention can capture global speaker characteristics in addition to local speech activity dynamics. Our source code is available online at https://github.com/hitachi-speech/EEND. Index Termsspeaker diarization, neural network, end-to-end, self-attention arXiv:1909.06247v1 [eess.AS] 13 Sep 2019 SAD MFCC X-vector extraction PLDA scoring AHC SAD neural network X-vector neural network Same/Diff covariance matrices Diarization result (a) X-vector clustering-based method Log-Mel Joint speech activity detection of all speakers EEND neural network Diarization result (b) EEND method

show abstract

Section: Clustering-based Methodsmentioning

confidence: 99%

End-to-End Neural Speaker Diarization with Self-Attention

Fujita

Kanda

Horiguchi

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

164

175

View full text Add to dashboard Cite

show abstract

“…Another field where deep metric learning has achieved successful results is the processing of audio signals [50]. The authors in [57] exploited Triplet and Quadruple networks for speaker diarization. They utilized different sampling strategies and margin parameter selection to observe their effect on diarization performance.…”

Section: Deep Metric Learning Problemsmentioning

confidence: 99%

Deep Metric Learning: A Survey

2019

View full text Add to dashboard Cite

Metric learning aims to measure the similarity among samples while using an optimal distance metric for learning tasks. Metric learning methods, which generally use a linear projection, are limited in solving real-world problems demonstrating non-linear characteristics. Kernel approaches are utilized in metric learning to address this problem. In recent years, deep metric learning, which provides a better solution for nonlinear data through activation functions, has attracted researchers' attention in many different areas. This article aims to reveal the importance of deep metric learning and the problems dealt with in this field in the light of recent studies. As far as the research conducted in this field are concerned, most existing studies that are inspired by Siamese and Triplet networks are commonly used to correlate among samples while using shared weights in deep metric learning. The success of these networks is based on their capacity to understand the similarity relationship among samples. Moreover, sampling strategy, appropriate distance metric, and the structure of the network are the challenging factors for researchers to improve the performance of the network model. This article is considered to be important, as it is the first comprehensive study in which these factors are systematically analyzed and evaluated as a whole and supported by comparing the quantitative results of the methods.

show abstract

“…The Coswara project was one of the first publicly available COVID-19 audio datasets and remains unique in its wide variety of sounds collected. Utilizing classical features such as MFCCs [37,38], spectral centroid and mean square energy to train a random forest classifier for the sound classification task, the authors report a test accuracy of 66%. More recently, Imran et al [2] developed tools that utilize CNNs trained with mel spectrograms for cough detection followed by model ensembling to determine whether or not the sample belonged to a COVID-19 patient.…”

Section: Related Workmentioning

confidence: 99%

COVID-19 detection using cough sound analysis and deep learning algorithms

Rao

Narayanaswamy

Esposito

et al. 2022

IDT

Self Cite

View full text Add to dashboard Cite

Reliable and rapid non-invasive testing has become essential for COVID-19 diagnosis and tracking statistics. Recent studies motivate the use of modern machine learning (ML) and deep learning (DL) tools that utilize features of coughing sounds for COVID-19 diagnosis. In this paper, we describe system designs that we developed for COVID-19 cough detection with the long-term objective of embedding them in a testing device. More specifically, we use log-mel spectrogram features extracted from the coughing audio signal and design a series of customized deep learning algorithms to develop fast and automated diagnosis tools for COVID-19 detection. We first explore the use of a deep neural network with fully connected layers. Additionally, we investigate prospects of efficient implementation by examining the impact on the detection performance by pruning the fully connected neural network based on the Lottery Ticket Hypothesis (LTH) optimization process. In general, pruned neural networks have been shown to provide similar performance gains to that of unpruned networks with reduced computational complexity in a variety of signal processing applications. Finally, we investigate the use of convolutional neural network architectures and in particular the VGG-13 architecture which we tune specifically for this application. Our results show that a unique ensembling of the VGG-13 architecture trained using a combination of binary cross entropy and focal losses with data augmentation significantly outperforms the fully connected networks and other recently proposed baselines on the DiCOVA 2021 COVID-19 cough audio dataset. Our customized VGG-13 model achieves an average validation AUROC of 82.23% and a test AUROC of 78.3% at a sensitivity of 80.49%.

show abstract

Designing an Effective Metric Learning Pipeline for Speaker Diarization

Cited by 19 publications

References 16 publications

End-to-End Neural Speaker Diarization with Self-Attention

End-to-End Neural Speaker Diarization with Self-Attention

Deep Metric Learning: A Survey

COVID-19 detection using cough sound analysis and deep learning algorithms

Contact Info

Product

Resources

About