Speaker recognition based on deep learning: An overview

Bai, Zhongxin; Zhang, Xiaolei

doi:10.1016/j.neunet.2021.03.004

Cited by 243 publications

(116 citation statements)

References 201 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Their research served as a short survey of the analytical inquiries and the explications of the speaker recognition domain. Zhongxin Bai et al [27] reviews various significant speaker recognition subdomains such as speaker identification, verification, diarization etc., focusing on deeplearning-based approaches. Modern and newly published deep learning-based feature extraction approaches, ASR algorithms are extensively explained in this paper.…”

Section: Reference Year Main Purpose Challengesmentioning

confidence: 99%

See 1 more Smart Citation

A Survey of Speaker Recognition: Fundamental Theories, Recognition Methods and Opportunities

et al. 2021

View full text Add to dashboard Cite

Humans can identify a speaker by listening to their voice, over the telephone, or on any digital devices. Acquiring this congenital human competency, authentication technologies based on voice biometrics, such as automatic speaker recognition (ASR), have been introduced. An ASR recognizes speakers by analyzing speech signals and characteristics extracted from speaker's voices. ASR has recently become an effective research area as an essential aspect of voice biometrics. Specifically, this literature survey gives a concise introduction to ASR and provides an overview of the general architectures dealing with speaker recognition technologies, and upholds the past, present, and future research trends in this area. This paper briefly describes all the main aspects of ASR, such as speaker identification, verification, diarization etc. Further, the performance of current speaker recognition systems are investigated in this survey with the limitations and possible ways of improvement. Finally, a few unsolved challenges of speaker recognition are presented at the closure of this survey.

show abstract

Section: Reference Year Main Purpose Challengesmentioning

confidence: 99%

“…In a stagewise speaker recognition systems, the recognition tasks such as speaker identification, speaker verification or speaker diarization are processed in two stages: front-end and back-end [27]. Various algorithms are employed in the front end and back end to complete the speaker recognition task.…”

Section: A Stagewise Speaker Recognitionmentioning

confidence: 99%

A Survey of Speaker Recognition: Fundamental Theories, Recognition Methods and Opportunities

et al. 2021

View full text Add to dashboard Cite

show abstract

“…In the last five years, deep learning methods have been demonstrated to outperform most of the classical speech and speaker recognition systems such as GMM-Universal Background Model (UBM), SVM, and i -vector [ 32 , 33 ]. However, deep learning systems require huge speech databases to be labeled and trained; theses databases also need to include phonetically rich sentences or at least phonetically balanced sentences [ 31 ].…”

Section: Related Workmentioning

confidence: 99%

A Two-Level Speaker Identification System via Fusion of Heterogeneous Classifiers and Complementary Feature Cooperation

Al-Qaderi

Lahamer

Rad

2021

Sensors

View full text Add to dashboard Cite

We present a new architecture to address the challenges of speaker identification that arise in interaction of humans with social robots. Though deep learning systems have led to impressive performance in many speech applications, limited speech data at training stage and short utterances with background noise at test stage present challenges and are still open problems as no optimum solution has been reported to date. The proposed design employs a generative model namely the Gaussian mixture model (GMM) and a discriminative model—support vector machine (SVM) classifiers as well as prosodic features and short-term spectral features to concurrently classify a speaker’s gender and his/her identity. The proposed architecture works in a semi-sequential manner consisting of two stages: the first classifier exploits the prosodic features to determine the speaker’s gender which in turn is used with the short-term spectral features as inputs to the second classifier system in order to identify the speaker. The second classifier system employs two types of short-term spectral features; namely mel-frequency cepstral coefficients (MFCC) and gammatone frequency cepstral coefficients (GFCC) as well as gender information as inputs to two different classifiers (GMM and GMM supervector-based SVM) which in total leads to construction of four classifiers. The outputs from the second stage classifiers; namely GMM-MFCC maximum likelihood classifier (MLC), GMM-GFCC MLC, GMM-MFCC supervector SVM, and GMM-GFCC supervector SVM are fused at score level by the weighted Borda count approach. The weight factors are computed on the fly via Mamdani fuzzy inference system that its inputs are the signal to noise ratio and the length of utterance. Experimental evaluations suggest that the proposed architecture and the fusion framework are promising and can improve the recognition performance of the system in challenging environments where the signal-to-noise ratio is low, and the length of utterance is short; such scenarios often arise in social robot interactions with humans.

show abstract

“…More specifically, a GMM-Universal Background Model (UBM) was used by [3] to predict PD severity in a longitudinal study. Yet the current trend has now shifted to the use of deep neural networks (DNN) [6]. Indeed, many recent performance advancements in speaker recognition and verification tasks are achieved through the use of x-vectors and other similar embedding approaches [73, 10].…”

Section: Introductionmentioning

confidence: 99%

Detecting a History of Repetitive Head Impacts from a Short Voice Recording

Tauro¹,

Ravanelli

Droppelmann

2021

Preprint

View full text Add to dashboard Cite

Repetitive head impacts (RHI) are associated with an increased risk of developing various neurodegenerative disorders, such as Alzheimer's disease (AD), Parkinson's disease (PD), and most notably, chronic traumatic encephalopathy (CTE). While the clinical presentation of AD and PD is well established, CTE can only be diagnosed post-mortem. Therefore, a distinction can be made between the pathologically defined CTE and RHI-related functional or structural brain changes (RHI-BC) which may result in CTE. Unfortunately, there are currently no accepted biomarkers of CTE nor RHI-BC, a major hurdle to achieving clinical diagnoses. Interestingly, speech has shown promise as a potential biomarker of both AD and PD, being used to accurately classify individuals with AD and PD from those without. Given the overlapping symptoms between CTE, RHI-BC, PD and AD, we aimed to determine if speech could be used to identify individuals with a history of RHI from those without. We therefore created the Verus dataset, consisting of 13 second voice recordings from 605 professional fighters (RHI group) and 605 professional athletes in non-contact sports (control group) for a total of 1210 recordings. Using a deep learning approach, we achieved 85% accuracy in detecting individuals with a history of RHI from those without. We then used our model trained on the Verus dataset to fine-tune on publicly available AD and PD speech datasets and achieved new state-of-the-art accuracies of 84.99% on the AD dataset and 89% on the PD dataset. Finding a biomarker of CTE and RHI-BC that presents early in disease progression is critical to improve risk management and patient outcome. Our study is the first we are aware of to investigate speech as such a candidate biomarker of RHI-BC.

show abstract

Speaker recognition based on deep learning: An overview

Cited by 243 publications

References 201 publications

A Survey of Speaker Recognition: Fundamental Theories, Recognition Methods and Opportunities

A Survey of Speaker Recognition: Fundamental Theories, Recognition Methods and Opportunities

A Two-Level Speaker Identification System via Fusion of Heterogeneous Classifiers and Complementary Feature Cooperation

Detecting a History of Repetitive Head Impacts from a Short Voice Recording

Contact Info

Product

Resources

About