Improving Short Utterance Speaker Recognition by Modeling Speech Unit Classes

Li, Lantian; Wang, Dong; Zhang, Chenhao; Zheng, Thomas Fang

doi:10.1109/taslp.2016.2544660

Cited by 44 publications

(24 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…373 nearest neighbors, vector quantization [6], hidden Markov model (HMM) [9], Gaussian mixture model (GMM) [10], artificial neural network [4], and deep neural network (DNN) [11]. Of the various classifiers available, in this research we selected GMM as our baseline for speaker recognition.…”

Section: Development Of Quranic Reciter Identification System Using Mmentioning

confidence: 99%

Development of Quranic Reciter Identification System using MFCC and GMM Classifier

Gunawan¹,

Saleh²,

Kartiwi³

2018

IJECE

View full text Add to dashboard Cite

Nowadays, there are many beautiful recitation of Al-Quran available. Quranic recitation has its own characteristics, and the problem to identify the reciter is similar to the speaker recognition/identification problem. The objective of this paper is to develop Quran reciter identification system using Mel-frequency Cepstral Coefficient (MFCC) and Gaussian Mixture Model (GMM). In this paper, a database of five Quranic reciters is developed and used in training and testing phases. We carefully randomized the database from various surah in the Quran so that the proposed system will not prone to the recited verses but only to the reciter. Around 15 Quranic audio samples from 5 reciters were collected and randomized, in which 10 samples were used for training the GMM and 5 samples were used for testing. Results showed that our proposed system has 100% recognition rate for the five reciters tested. Even when tested with unknown samples, the proposed system is able to reject it.

show abstract

Section: Development Of Quranic Reciter Identification System Using Mmentioning

confidence: 99%

Development of Quranic Reciter Identification System using MFCC and GMM Classifier

Gunawan¹,

Saleh²,

Kartiwi³

2018

IJECE

View full text Add to dashboard Cite

show abstract

“…In several ASR, the clients are hesitant to provide sufficient voice data, especially for testing, in phone banking. In different circumstances, it is profoundly hard to gather adequate speech data, for instance in legal applications [9].…”

Section: A Challenges With Limited Speech Data In Asrmentioning

confidence: 99%

“…The recent research advocate if the speech data utilised during testing phase bring down 10% (from 20 sec of speech data to 2 sec of speech data) the performance of ASR degraded abruptly from 6.34% to 23.89% in terms of equal error rate (EER) [9]. In ASR application once testing speech data is less than 2 sec the performance of the system in terms of EER 35% has been reported by Mak et al [10].…”

Section: A Challenges With Limited Speech Data In Asrmentioning

confidence: 99%

“…What's more, a scorebased portion determination system has been proposed in [9], which estimates the superiority of every test speech portion taking into account an arrangement of companion models, and scores the test speech with the dependable fragments as it were. A relative EER attenuation of 22% was reported in ASR when the test speech data are shorter than 15 sec.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Denoising and Enhancement of Medical Images Using Wavelets in LabVIEW

Singh¹,

Kumar²,

Kolluri³

2015

IJIGSP

View full text Add to dashboard Cite

Abstract-As on date, Speaker-specific feature extraction and modelling techniques has been designed in automatic speaker recognition (ASR) for a sufficient amount of speech data. Once the speech data is limited the ASR performance degraded drastically. ASR system for limited speech data is always a highly challenging task due to a short utterance. The main goal of ASR to form a judgment for an incoming speaker to the system as being which member of registered speakers. This paper presents a comparison of three different modelling techniques of speaker specific extracted information (i) Fuzzy c-means (FCM) (ii) Fuzzy Vector Quantization2 (FVQ2) and (iii) Novel Fuzzy Vector Quantization (NFVQ). Using these three modelling techniques, we developed a text independent automatic speaker recognition system that is computationally modest and equipped for recognizing a non-cooperative speaker. In this investigation, the speaker recognition efficiency is compared to less than 2 sec of text-independent test and train utterances of Texas Instruments and Massachusetts Institute of Technology (TIMIT) and self-collected database. The efficiency of ASR has been improved by 1% with the baseline by hiding the outliers and assigns them by their closest codebook vectors the efficiency of proposed modelling techniques is 98.8%, 98.1% respectively for TIMIT and self-collected database.

show abstract

“…Alternatively, several approaches have been proposed that leverage phonetic information to perform content matching. The work in Li et al (2016) proposes a GMM based subregion framework where speaker models are trained for each subregion defined by phonemes. Test utterances are then scored with subregion models.…”

Section: Introductionmentioning

confidence: 99%

Deep neural network based i-vector mapping for speaker verification using short utterances

Guo

Qian

et al. 2018

Speech Communication

View full text Add to dashboard Cite

Text-independent speaker recognition using short utterances is a highly challenging task due to the large variation and content mismatch between short utterances. I-vector and probabilistic linear discriminant analysis (PLDA) based systems have become the standard in speaker verification applications, but they are less effective with short utterances. In this paper, we first compare two state-of-the-art universal background model (UBM) training methods for i-vector modeling using full-length and short utterance evaluation tasks. The two methods are Gaussian mixture model (GMM) based (denoted I-vector GMM) and deep neural network (DNN) based (denoted as I-vector DNN) methods. The results indicate that the I-vector DNN system outperforms the I-vector GMM system under various durations (from full length to 5 s). However, the performances of both systems degrade significantly as the duration of the utterances decreases. To address this issue, we propose two novel nonlinear mapping methods which train DNN models to map the i-vectors extracted from short utterances to their corresponding long-utterance i-vectors. The mapped i-vector can restore missing information and reduce the variance of the original short-utterance i-vectors. The proposed methods both model the joint representation of short and long utterance i-vectors: the first method trains an autoencoder first using concatenated short and long utterance i-vectors and then uses the pre-trained weights to initialize a supervised regression model from the short to long version; the second method jointly trains the supervised regression model with an autoencoder reconstructing the short utterance i-vector itself. Experimental results using the NIST SRE 2010 dataset show that both methods provide significant improvement and result in a 24.51% relative improvement in Equal Error Rates (EERs) from a baseline system. In order to learn a better joint representation, we further investigate the effect of a deep encoder with residual blocks, and the results indicate that the residual network can further improve the EERs of a baseline system by up to 26.47%. Moreover, in order to improve the short i-vector mapping to its long version, an additional vector, which represents the average value of phoneme posteriors across frames, is also added to the input, and results in a 28.43% improvement. When further testing the best-validated models of SRE10 on the Speaker In The Wild (SITW) dataset, the methods result in a 23.12% improvement on arbitrary-duration (1-5 s) short-utterance conditions.

show abstract

Improving Short Utterance Speaker Recognition by Modeling Speech Unit Classes

Cited by 44 publications

References 25 publications

Development of Quranic Reciter Identification System using MFCC and GMM Classifier

Development of Quranic Reciter Identification System using MFCC and GMM Classifier

Denoising and Enhancement of Medical Images Using Wavelets in LabVIEW

Deep neural network based i-vector mapping for speaker verification using short utterances

Contact Info

Product

Resources

About