Enhancement in speaker recognition for optimized speech features using GMM, SVM and 1-D CNN

Nainan, Sumita; Kulkarni, Vaishali

doi:10.1007/s10772-020-09771-2

Cited by 23 publications

(6 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Multiple machine learning-based classifiers, including the GMM, hidden Markov model (HMM) [21], multilayer perceptron (MLP), k-nearest neighbor (k-NN) [22], support vector machine (SVM) [23], and random forest (RF), have been used by many researchers to identify speakers from audio data signals. These classifiers have been extensively used in speech-related applications, including automatic speaker identification and emotion recognition.…”

Section: Classification Methodsmentioning

confidence: 99%

An Efficient Text-Independent Speaker Identification Using Feature Fusion and Transformer Model

Khan¹,

Jahangir²,

Alroobaea³

et al. 2023

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

Automatic Speaker Identification (ASI) involves the process of distinguishing an audio stream associated with numerous speakers' utterances. Some common aspects, such as the framework difference, overlapping of different sound events, and the presence of various sound sources during recording, make the ASI task much more complicated and complex. This research proposes a deep learning model to improve the accuracy of the ASI system and reduce the model training time under limited computation resources. In this research, the performance of the transformer model is investigated. Seven audio features, chromagram, Mel-spectrogram, tonnetz, Mel-Frequency Cepstral Coefficients (MFCCs), delta MFCCs, delta-delta MFCCs and spectral contrast, are extracted from the ELSDSR, CSTR-VCTK, and Ar-DAD, datasets. The evaluation of various experiments demonstrates that the best performance was achieved by the proposed transformer model using seven audio features on all datasets. For ELSDSR, CSTR-VCTK, and Ar-DAD, the highest attained accuracies are 0.99, 0.97, and 0.99, respectively. The experimental results reveal that the proposed technique can achieve the best performance for ASI problems.

show abstract

Section: Classification Methodsmentioning

confidence: 99%

An Efficient Text-Independent Speaker Identification Using Feature Fusion and Transformer Model

Khan¹,

Jahangir²,

Alroobaea³

et al. 2023

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

show abstract

“…The block diagram for CNN architecture has be explained as in figure 3. Using differential features, along with high level convolution features contain sufficient speaker information and yielded better results for ASR [51] The convolution layers consist of a set of filters or kernels which moves across the input image information in a specified manner to perform convolution. The computation for a convolution layer in the lth layer is given in equation ( 9).…”

Section: Cnn For Feature Extractionmentioning

confidence: 99%

Multimodal Speaker Recognition using voice and lip movement with decision and feature level fusion

Nainan,

Kulkarni

2024

Preprint

Self Cite

View full text Add to dashboard Cite

The speech generation mechanism is fundamentally bimodal in nature. It is an audio and visual representation. Including visual information obtained from the lip movement of a speaker, in addition to the voice is hence justified for a text independent automatic speaker recognition system (ASR). Additionally, lip movement information is invariant to acoustic noise perturbation making the system more robust. Hence, we were motivated to design a dynamic audio-visual speaker recognition system. The objective of this research is to identify a speaker from its voice regardless of the spoken content and strengthen the accuracy of recognition. Classical methods and state of art neural networks has been employed to accomplish this. The learning model for voice modality was computed by concatenating dynamic features to the handcrafted features, which were further optimized using Fisher score technique, leading to improvement in speaker recognition. Support Vector Machines (SVM) and Convolution Neural Network (CNN) classifiers gave a competitive accuracy of 94.77%. For extracting information from the lip movement, Histogram of Gradient (HOG) feature detector algorithm was implemented on the image frames obtained from the video. Unique lip movements were was also computed from the mouth region landmark points of Facial Landmarks. Multimodal framework was accomplished by feature level fusion of voice and lip features with CNN as classifier. The significance of the proposed work lies in the novel use of CNN for speech features. The authors have successfully demonstrated that lip movement features help in liveness detection along with automatic speaker recognition (ASR). The proposed method achieves 91.4% testing accuracy in comparison to the state-of-the-art method.

show abstract

“…But only ten speakers were used for evaluation, and each utterance contained only one word. Nainan Kulkarni et al [20] evaluated 1D CNN, SVM and GMM based on dynamic MFCC features. The 1D CNN-based model achieved a validation accuracy of about 73.25% on the VidTimit dataset.…”

Section: Literature Reviewmentioning

confidence: 99%

Enhancement in Speaker Identification through Feature Fusion using Advanced Dilated Convolution Neural Network

Pentapati

Sridevi

2023

Int. j. electr. comput. eng. syst. (Online)

View full text Add to dashboard Cite

There are various challenges in identifying the speakers accurately. The Extraction of discriminative features is a vital task for accurate identification in the speaker identification task. Nowadays, speaker identification is widely investigated using deep learning. The complex and noisy speech data affects the performance of Mel Frequency Cepstral Coefficients (MFCC); hence, MFCC fails to represent the speaker characteristics accurately. In this proposed work, a novel text-independent speaker identification system is developed to enhance the performance by fusion of Log-MelSpectrum and excitation features. The excitation information is obtained due to the vibration of vocal folds, and it is represented using Linear Prediction (LP) residual. The various types of features extracted from the excitation are residual phase, sharpness, Energy of Excitation (EoE), and Strength of Excitation (SoE). The extracted features were processed with the dilated convolution neural network (dilated CNN) to fulfill the identification task. The extensive evaluation showed that the fusion of excitation features gives better results than the existing methods. The accuracy reaches 94.12% for 11 complex classes and 91.34% for 80 speakers, and Equal Error Rate (EER) is reduced to 1.16% for the proposed model. The proposed model is tested with the Librispeech corpus using Matlab 2021b tool, outperforming the existing baseline models. The proposed model achieves an accuracy improvement of 1.34% compared to the baseline system.

show abstract

Enhancement in speaker recognition for optimized speech features using GMM, SVM and 1-D CNN

Cited by 23 publications

References 43 publications

An Efficient Text-Independent Speaker Identification Using Feature Fusion and Transformer Model

An Efficient Text-Independent Speaker Identification Using Feature Fusion and Transformer Model

Multimodal Speaker Recognition using voice and lip movement with decision and feature level fusion

Enhancement in Speaker Identification through Feature Fusion using Advanced Dilated Convolution Neural Network

Contact Info

Product

Resources

About