Analysis of DNN approaches to speaker identification

Matějka, Pavel; Glembek, Ondřej; Novotny, Ondrej; Plchot, Oldřich; Grézl, František; Burget, Lukáš; Černocký, Jaň

doi:10.1109/icassp.2016.7472649

Cited by 66 publications

(72 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…So, Deep Neural Network (DNN) has also a perfect and successful entry into the field of audio signal processing. DNN was introduced into the field of speaker identification as a successor of Automatic Speech Recognition (ASR) which was a comprehensive success [19].…”

Section: Deep Neural Networkmentioning

confidence: 99%

“…Ali et.al[22] studied the use of features from distinct levels of Deep Belief Network (DBN) to quantize the audio data into vectors of "audio-word counts".Table 6demonstrates speaker identification performance based on each of GMM-DNN and the three different prior work using the ESD. This table yields 5.8%, 3.9%, and 5.6% improvement rate in speaker identification performance based on the proposed classifier over that based on DNN-BN[19], single DNN[21], and DBN[22], respectively. Hence, GMM-DNN offers a robust and computationally efficient novel classification technique for "speaker identification in emotional environments".…”

mentioning

confidence: 93%

“…This experiment has been conducted to show the relevance of the proposed GMM-DNN as a classifier to enhance speaker identification performance in emotional environments and to compare it with other classifiers in the literature[19],[21],[22]. Matejka et.al[19] studied utilizing Deep Neural Network Bottleneck (DNN-BN) features together with MFCCs in the task of i-vector-based speaker recognition.Richardson et.al[21] presented the application of single DNN for both speaker recognition and language recognition using the "2013 Domain Adaptation Challenge speaker recognition (DAC13)" and the "NIST 2011 Language Recognition Evaluation (LRE11)" benchmarks. Ali et.al[22] studied the use of features from distinct levels of Deep Belief Network (DBN) to quantize the audio data into vectors of "audio-word counts".Table 6demonstrates speaker identification performance based on each of GMM-DNN and the three different prior work using the ESD.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments

Shahin

Nassif

Hamsa

2018

Neural Comput & Applic

View full text Add to dashboard Cite

This research is an effort to present an effective approach to enhance text-independent speaker identification performance in emotional talking environments based on novel classifier called cascaded Gaussian Mixture Model-Deep Neural Network (GMM-DNN). Our current work focuses on proposing, implementing and evaluating a new approach for speaker identification in emotional talking environments based on cascaded Gaussian Mixture Model-Deep Neural Network as a classifier. The results point out that the cascaded GMM-DNN classifier improves speaker identification performance at various emotions using two distinct speech databases:Emirati speech database (Arabic United Arab Emirates dataset) and "Speech Under Simulated and Actual Stress (SUSAS)" English dataset. The proposed classifier outperforms classical classifiers such as Multilayer Perceptron (MLP) and Support Vector Machine (SVM) in each dataset. Speaker identification performance that has been attained based on the cascaded GMM-DNN is similar to that acquired from subjective assessment by human listeners.2 Keywords: deep neural network; emotional talking environments; Gaussian mixture model; speaker identification. IntroductionSpeaker recognition and its sub divisional entities: "speaker identification and speaker verification" need to be redefined on the basis of talking environments as the neutral talking environment and the emotional talking environments. Speaker recognition performance faces drastic challenges, especially when the speaker identity going through the human-computer interface in emotional talking environments. Speaker recognition applications in security systems are widening their base into banking sector, customer care sector, criminal investigation and can be used as security control measure to remotely access a server or for access to confidential library files on a server. The process of "automatic speaker identification and verification" in stressful and emotional talking environments is a challenging area of research [1]. "Speaker identification" is comprised of two schemes in terms of sets: "closed set" and "open set" speaker identification. When the unknown speaker is presumed to be one among the database of known speakers, it becomes the scheme of a "closed set", while in the scheme of an "open set", the unfamiliar speaker might not necessarily be from the database of familiar speakers. Operational procedure divides speaker identification into "text-dependent", where the same text is uttered by the speaker in the training and testing phases and "text-independent", where different texts are uttered by the speaker during the training and testing phases [2]. 3 A perfect communication from a speaker depends not only on linguistic statements but also on the emotional aspects of the speaker. Identifying the emotional aspects of the speaker by the machine is still a challenge of the human-machine interface. Speech is always a perfect mix of linguistic notes linked with emotion along with its paralinguistic features. Emotion recognition from ...

show abstract

Section: Deep Neural Networkmentioning

confidence: 99%

mentioning

confidence: 93%

mentioning

confidence: 99%

See 1 more Smart Citation

Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments

Shahin

Nassif

Hamsa

2018

Neural Comput & Applic

View full text Add to dashboard Cite

show abstract

“…For speaker-related tasks, uncertainty in features can be represented by several speaker models, among which Vector Quantization (VQ), Gaussian mixture models (GMMs) [1] and i-vector [2] are the most successful examples proposed in the past decades. Recently, deep Neural Networks (DNNs), especially Convolution Neural Networks (CNNs) also have been widely and successfully applied to extract deep features to represent speakers [3], [4], [5].…”

Section: Introductionmentioning

confidence: 99%

Noise Invariant Frame Selection: A Simple Method to Address the Background Noise Problem for Text-independent Speaker Verification

Song

Zhang

Schuller

et al. 2018

2018 International Joint Conference on Neural Networks (IJCNN)

View full text Add to dashboard Cite

The performance of speaker-related systems usually degrades heavily in practical applications largely due to the presence of background noise. To improve the robustness of such systems in unknown noisy environments, this paper proposes a simple pre-processing method called Noise Invariant Frame Selection (NIFS). Based on several noisy constraints, it selects noise invariant frames from utterances to represent speakers. Experiments conducted on the TIMIT database showed that the NIFS can significantly improve the performance of Vector Quantization (VQ), Gaussian Mixture Model-Universal Background Model (GMM-UBM) and i-vector-based speaker verification systems in different unknown noisy environments with different SNRs, in comparison to their baselines. Meanwhile, the proposed NIFS-based speaker verification systems achieves similar performance when we change the constraints (hyperparameters) or features, which indicates that it is robust and easy to reproduce. Since NIFS is designed as a general algorithm, it could be further applied to other similar tasks.

show abstract

“…Various neural network-based approaches were proposed in [18], without considering different noise and handset conditions. Furthermore, other researchers have employed deep neural network (DNN) analysis for speaker identification [19]. In [20], the authors selected 100 speakers from the TIMIT and self-collected databases using novel fuzzy vector quantization (NFVQ) techniques to enhance the speaker identification system (SIS).…”

Section: Introductionmentioning

confidence: 99%

Evaluation of a speaker identification system with and without fusion using three databases in the presence of noise and handset effects

Al-Kaltakchi

Woo

Dlay

et al. 2017

EURASIP J. Adv. Signal Process.

View full text Add to dashboard Cite

In this study, a speaker identification system is considered consisting of a feature extraction stage which utilizes both power normalized cepstral coefficients (PNCCs) and Mel frequency cepstral coefficients (MFCC). Normalization is applied by employing cepstral mean and variance normalization (CMVN) and feature warping (FW), together with acoustic modeling using a Gaussian mixture model-universal background model (GMM-UBM). The main contributions are comprehensive evaluations of the effect of both additive white Gaussian noise (AWGN) and non-stationary noise (NSN) (with and without a G.712 type handset) upon identification performance. In particular, three NSN types with varying signal to noise ratios (SNRs) were tested corresponding to street traffic, a bus interior, and a crowded talking environment. The performance evaluation also considered the effect of late fusion techniques based on score fusion, namely, mean, maximum, and linear weighted sum fusion. The databases employed were TIMIT, SITW, and NIST 2008; and 120 speakers were selected from each database to yield 3600 speech utterances. As recommendations from the study, mean fusion is found to yield overall best performance in terms of speaker identification accuracy (SIA) with noisy speech, whereas linear weighted sum fusion is overall best for original database recordings.

show abstract

Analysis of DNN approaches to speaker identification

Cited by 66 publications

References 11 publications

Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments

Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments

Noise Invariant Frame Selection: A Simple Method to Address the Background Noise Problem for Text-independent Speaker Verification

Evaluation of a speaker identification system with and without fusion using three databases in the presence of noise and handset effects

Contact Info

Product

Resources

About