A Two-Level Speaker Identification System via Fusion of Heterogeneous Classifiers and Complementary Feature Cooperation

Al-Qaderi, Mohammad K.; Lahamer, Elfituri; Rad, A.B.

doi:10.3390/s21155097

Cited by 11 publications

(17 citation statements)

References 62 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Feature extraction is accomplished by changing the speech waveform to a form of parametric representation at a relatively lesser data rate for subsequent processing and analysis [ 11 , 12 , 13 , 14 ]. Feature extraction approaches usually yield a multidimensional feature vector for every speech signal.…”

Section: Related Workmentioning

confidence: 99%

Improved Feature Parameter Extraction from Speech Signals Using Machine Learning Algorithm

Abdusalomov

Safarov

Rakhimov

et al. 2022

Sensors

View full text Add to dashboard Cite

Speech recognition refers to the capability of software or hardware to receive a speech signal, identify the speaker’s features in the speech signal, and recognize the speaker thereafter. In general, the speech recognition process involves three main steps: acoustic processing, feature extraction, and classification/recognition. The purpose of feature extraction is to illustrate a speech signal using a predetermined number of signal components. This is because all information in the acoustic signal is excessively cumbersome to handle, and some information is irrelevant in the identification task. This study proposes a machine learning-based approach that performs feature parameter extraction from speech signals to improve the performance of speech recognition applications in real-time smart city environments. Moreover, the principle of mapping a block of main memory to the cache is used efficiently to reduce computing time. The block size of cache memory is a parameter that strongly affects the cache performance. In particular, the implementation of such processes in real-time systems requires a high computation speed. Processing speed plays an important role in speech recognition in real-time systems. It requires the use of modern technologies and fast algorithms that increase the acceleration in extracting the feature parameters from speech signals. Problems with overclocking during the digital processing of speech signals have yet to be completely resolved. The experimental results demonstrate that the proposed method successfully extracts the signal features and achieves seamless classification performance compared to other conventional speech recognition algorithms.

show abstract

Section: Related Workmentioning

confidence: 99%

Improved Feature Parameter Extraction from Speech Signals Using Machine Learning Algorithm

Abdusalomov

Safarov

Rakhimov

et al. 2022

Sensors

View full text Add to dashboard Cite

show abstract

“…It does not correspond linearly to the physical frequency of the tone, as the human auditory system apparently does not perceive pitch linearly. The Mel Filter is approximately a linear frequency spacing below 1 kHz and a logarithmic spacing above 1 kHz [4].…”

Section: Windowingmentioning

confidence: 99%

“…In addition to fusion of feature extraction techniques, fusion of different types of classi ers are applied to improve the performance of speaker recognition. In [4] fusion of GMM and SVM is used to develop a speaker recognition system with MFCC and GFCC feature extraction techniques. In the study [9] speaker veri cation is developed by using fusion of GMM and ANN models with GFCC features.…”

Section: Related Workmentioning

confidence: 99%

“…It has important applications in the areas like: Access control, forensic science, surveillance, law enforcement and nancial areas for the purpose of identi cation, detection, segmenting and clustering [2]. Previously, most of the speaker recognition systems are developed using a machine learning classi ers like Gaussian Mixture Model (GMM) [3], Support Vector Machine (SVM) [4], i-vector [5] and Hidden Markov Model (HMM). These classi ers use handcrafted features mainly Mel Frequency Cepstral Coe cient (MFCC) [3], Gammatone Frequency Cepstral Coe cient (GFCC) [6] and Linear Predictive Cepstral Coe cient (LPCC) [7] for speaker enrollment and recognition.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Fusion of Cochleogram and Mel Spectrogram Features for Deep Learning Based Speaker Recognition

Lambamo

Srinivasa

Jifara

2022

Preprint

View full text Add to dashboard Cite

Speaker recognition has crucial application in forensic science, financial areas, access control, surveillance and law enforcement. The performance of speaker recognition get degraded with the noise, speakers physical and behavioral changes. Fusion of Mel Frequency Cepstral Coefficient (MFCC) and Gammatone Frequency Cepstral Coefficient (GFCC) features are used to improve the performance of machine learning based speaker recognition systems in the noisy condition. Deep learning models, especially Convolutional Neural Network (CNN) and its hybrid approaches outperform machine learning approaches in speaker recognition. Previous CNN based speaker recognition models has used Mel Spectrogram features as an input. Even though, Mel Spectrogram features show better performance compared to the handcrafted features, its performance get degraded with noise and behavioral changes of speaker. In this work, a CNN based speaker recognition model is developed using fusion of Mel Spectrogram and Cochleogram feature as input. The speaker recognition performance of the fusion of Mel Spectrogram and Cochleogram features is compared with the performance of Mel Spectrogram and Cochleogram features without fusing. The train-clean-100 part of the LibriSpeech dataset, which consists of 251 speakers (126 male and 125 female speakers) and 28,539 utterances is used for the experiment of proposed model. CNN model is trained and evaluated for 20 epochs using training and validation data respectively. Proposed speaker recognition model which uses fusion of Mel Spectrogram and Cochleogram as input for CNN has accuracy of 99.56%. Accuracy of CNN based speaker recognition with Mel Spectrogram is 98.15% and Cochleogram features is 97.43%. The results show that fusion of Mel Spectrogram and Cochleogram features improve the performance of speaker recognition.

show abstract

“…Some well-designed metric learning losses have been exploited to train the entire SV system in an end-to-end fashion, such as triplet loss [ 9 , 11 ], generalized end-to-end (GE2E) loss [ 18 ], and cluster-range loss [ 4 ]. Besides, many studies on robust features [ 20 , 21 ] and hybrid models [ 21 , 22 ] have been conducted to further improve the performance of traditional and DNN-based speaker recognition systems. In recent years, CNN has drawn much attention in this research field.…”

Section: Introductionmentioning

confidence: 99%

Attention-Based Temporal-Frequency Aggregation for Speaker Verification

Wang

Feng

et al. 2022

Sensors

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) have significantly promoted the development of speaker verification (SV) systems because of their powerful deep feature learning capability. In CNN-based SV systems, utterance-level aggregation is an important component, and it compresses the frame-level features generated by the CNN frontend into an utterance-level representation. However, most of the existing aggregation methods aggregate the extracted features across time and cannot capture the speaker-dependent information contained in the frequency domain. To handle this problem, this paper proposes a novel attention-based frequency aggregation method, which focuses on the key frequency bands that provide more information for utterance-level representation. Meanwhile, two more effective temporal-frequency aggregation methods are proposed in combination with the existing temporal aggregation methods. The two proposed methods can capture the speaker-dependent information contained in both the time domain and frequency domain of frame-level features, thus improving the discriminability of speaker embedding. Besides, a powerful CNN-based SV system is developed and evaluated on the TIMIT and Voxceleb datasets. The experimental results indicate that the CNN-based SV system using the temporal-frequency aggregation method achieves a superior equal error rate of 5.96% on Voxceleb compared with the state-of-the-art baseline models.

show abstract

A Two-Level Speaker Identification System via Fusion of Heterogeneous Classifiers and Complementary Feature Cooperation

Cited by 11 publications

References 62 publications

Improved Feature Parameter Extraction from Speech Signals Using Machine Learning Algorithm

Improved Feature Parameter Extraction from Speech Signals Using Machine Learning Algorithm

Fusion of Cochleogram and Mel Spectrogram Features for Deep Learning Based Speaker Recognition

Attention-Based Temporal-Frequency Aggregation for Speaker Verification

Contact Info

Product

Resources

About