Noise Robust Speaker Recognition Based on Adaptive Frame Weighting in GMM for i-Vector Extraction

Zhang, Xingyu; Zou, Xia; Sun, Meng; Zheng, Thomas Fang; Jia, Chong; Wang, Yimin

doi:10.1109/access.2019.2901812

Cited by 22 publications

(9 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The noise reduction method in this study uses adaptive noise-canceling (ANC) with the least mean square (LMS) algorithm [18], [19]. This method has a simple and reliable structure [20], [21]. The structure of the LMS algorithm is shown in Figure 2(a).…”

Section: Adaptive Noise-cancelingmentioning

confidence: 99%

Feature extraction with mel scale separation method on noise audio recordings

Huizen

Kurniati

2021

IJEECS

View full text Add to dashboard Cite

This paper focuses on improving the accuracy of noise audio recordings. High-quality audio recording, extraction using the mel frequency cepstral coefficients (MFCC) method produces high accuracy. While the low-quality is because of noise, the accuracy is low. Improved accuracy by investigating the effect of bandwidth on the mel scale. The proposed improvement uses the mel scale separation methods into two frequency channels (MFCC dual-channel). For the comparison method using the mel scale bandwidth without separation (MFCC single-channel). Feature analysis using k-mean clustering. The data uses a noise variance of up to -16 dB. Testing on the MFCC single-channel method for -16 dB noise has an accuracy of 47.5%, while the MFCC dual-channel method has an accuracy better of 76.25%. The next test used adaptive noise-canceling (ANC) to reduce noise before extraction. The result is that the MFCC single-channel method has an accuracy of 82.5% and the MFCC dual-channel method has an accuracy better of 83.75%. High-quality audio recording testing for the MFCC single-channel method has an accuracy of 92.5% and the MFCC dual-channel method has an accuracy better of 97.5%. The test results show the effect of mel scale bandwidth to increase accuracy. The MFCC dual-channel method has higher accuracy.

show abstract

Section: Adaptive Noise-cancelingmentioning

confidence: 99%

Feature extraction with mel scale separation method on noise audio recordings

Huizen

Kurniati

2021

IJEECS

View full text Add to dashboard Cite

show abstract

“…From the perspective of the speech recognition model, the application of speech signal can be roughly divided into three categories, including vocal print recognition, speech recognition, and emotion recognition [8]. The classifiers for speech recognition tasks include traditional classifiers and deep learning algorithms, involving HMM, Gaussian Mixture Model (GMM), support vector machine (SVM), and extreme learning machine (ELM) [9][10][11]. At present, the role of acoustic parameters is analyzed in the objective evaluation of artistic vocal, and methods of it are proposed based on error back propagation (BP) and learning vector quantization (LVQ) [12].…”

Section: Introductionmentioning

confidence: 99%

Objective Evaluation Method of Broadcasting Vocal Timbre Based on Feature Selection

Lan

et al. 2022

Wireless Communications and Mobile Computing

View full text Add to dashboard Cite

Broadcasting voice is used to convey ideas and emotions. In the selection process of broadcasting and hosting professionals, the vocal timbre is an important index. The subjective evaluation method is widely used, but the selection results have certain subjectivity and uncertainty. In this paper, an objective evaluation method of broadcasting vocal timbre is proposed. Firstly, the broadcasting vocal timbre database is constructed on Chinese phonetic characteristics. Then, the timbre feature selection strategy is presented based on human vocal mechanism, and the broadcast timbre characteristics are divided into three categories, which include source parameters, vocal tract parameters, and human hearing parameters. Finally, the three models of hidden Markov model (HMM), Gaussian Mixture Model-General Background Model (GMM-UBM), and long short-term memory (LSTM) are exploited to evaluate the timbre of the broadcast by extracting timbre features and four timbre feature combinations. The experiments show that the selection of timbre features is scientific and effective. Moreover, the accuracy of the LSTM network using the deep learning algorithm in the objective evaluation of the broadcast timbre is better than the traditional HMM and GMM-UBM, and the proposed method can achieve about 95% accuracy rate in our database.

show abstract

“…Even though TI-SV is more challenging than TD-SV because of the phonetic variability, TI-SV is more convenient from a user point of view in that the user can speak freely to the system. Over the past decades, the i-vector approach [2] with probabilistic linear discriminant analysis (PLDA) [3] has been widely used for TI-SV [4]- [7]. The i-vector approach learns a low-dimensional representation containing both speaker and channel variability, through which a variable-length utterance can be represented as a fixed-dimensional i-vector.…”

Section: Introductionmentioning

confidence: 99%

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

et al. 2020

View full text Add to dashboard Cite

Speaker verification (SV) has recently attracted considerable research interest due to the growing popularity of virtual assistants. At the same time, there is an increasing requirement for an SV system: it should be robust to short speech segments, especially in noisy and reverberant environments. In this paper, we consider one more important requirement for practical applications: the system should be robust to an audio stream containing long non-speech segments, where a voice activity detection (VAD) is not applied. To meet these two requirements, we introduce feature pyramid module (FPM)-based multiscale aggregation (MSA) and self-adaptive soft VAD (SAS-VAD). We present the FPM-based MSA to deal with short speech segments in noisy and reverberant environments. Also, we use the SAS-VAD to increase the robustness to long non-speech segments. To further improve the robustness to acoustic distortions (i.e., noise and reverberation), we apply a masking-based speech enhancement (SE) method. We combine SV, VAD, and SE models in a unified deep learning framework and jointly train the entire network in an endto-end manner. To the best of our knowledge, this is the first work combining these three models in a deep learning framework. We conduct experiments on Korean indoor (KID) and VoxCeleb datasets, which are corrupted by noise and reverberation. The results show that the proposed method is effective for SV in the challenging conditions and performs better than the baseline i-vector and deep speaker embedding systems.

show abstract

Noise Robust Speaker Recognition Based on Adaptive Frame Weighting in GMM for i-Vector Extraction

Cited by 22 publications

References 29 publications

Feature extraction with mel scale separation method on noise audio recordings

Feature extraction with mel scale separation method on noise audio recordings

Objective Evaluation Method of Broadcasting Vocal Timbre Based on Feature Selection

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

Contact Info

Product

Resources

About