Speech Enhancement Based on Teacher–Student Deep Learning Using Improved Speech Presence Probability for Noise-Robust Speech Recognition

Tu, Yan-Hui; Du, Jun; Lee, Chin‐Hui

doi:10.1109/taslp.2019.2940662

Cited by 79 publications

(36 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To solve over-smoothing and speech information loss problems in a low SNR, Chai et al propose a prediction error model by considering a generalized Gaussian distribution (GGD) for DNN-based SE [17]. Tu et al propose a novel teacher-student learning framework for the preprocessing of a speech recognizer, leveraging the online noise tracking capabilities of improved MCRA and DNN network of nonlinear interactions between speech and noise [18]. To leverage long-term contexts for tracking a target speaker, Tan et al present a novel convolutional neural network (CNN) architecture for monaural SE [19].…”

Section: In the 1980s Ephraim And Malah Proposed The Minimummentioning

confidence: 99%

Real-Time Speech Enhancement Algorithm Based on Attention LSTM

et al. 2020

View full text Add to dashboard Cite

Because traditional single-channel speech enhancement algorithms are sensitive to the environment and perform poorly, a speech enhancement algorithm based on attention-gated long short-term memory (LSTM) is proposed. To simulate human auditory perceptual characteristics, the algorithm divides the frequency band according to the Bark scale. Based on these bands, bark frequency cepstral coefficients (BFCCs), their derivative features and pitch-based features are extracted. Furthermore, considering that different noises have different influence on the clean speech, the attention mechanism is applied to screen out the information less polluted by noise, which is helpful to reconstruct the clean speech. To adaptively reallocate the power ratio of the speech and noise during the construction of the ratio mask, the ideal ratio mask (IRM) with the inter-channel correlation (ICC) is adopted as the learning target. In addition, to improve the performance of the network, the algorithm introduces a multiobjective learning strategy to jointly optimize the networks by using a voice activity detector (VAD). Subjective and objective experiments show that the proposed algorithm outperforms other baseline algorithms. In real-time experiment, the proposed algorithm maintains high real-time performance and fast convergence speed. INDEX TERMS Speech enhancement, long short-term memory, attention mechanism, bark scale.

show abstract

Section: In the 1980s Ephraim And Malah Proposed The Minimummentioning

confidence: 99%

Real-Time Speech Enhancement Algorithm Based on Attention LSTM

et al. 2020

View full text Add to dashboard Cite

show abstract

“…Specifically, in the training process, when the value of the loss function is smaller, the WER is not necessarily lower. Discrimination training can alleviate this problem by using the solution of the traditional speech recognition system for reference [41]- [43]. In this paper, the Maximum Mutual Information (MMI) criterion is used for discriminative training.…”

Section: F Discriminative Trainingmentioning

confidence: 99%

End-to-End Amdo-Tibetan Speech Recognition Based on Knowledge Transfer

Zhu

Huang

2020

IEEE Access

View full text Add to dashboard Cite

The end-to-end speech recognition technology solves the problem that each component is independent and models cannot be jointly optimized in the traditional speech recognition model. It incorporates such components as the acoustic model, language model, and decoding unit of the hybrid model into a single neural network, that can avoid the inherent defects of multiple modules and greatly reduces the complexity of the speech recognition model. In this research, an Amdo-Tibetan speech recognition system is constructed based on Listen, Attend and Spell (LAS) model by the end-to-end speech recognition technology. It can realize the direct conversion from Amdo-Tibetan speech sequence to the corresponding character sequence and greatly reduces the difficulty of building the Amdo-Tibetan speech recognition model. To further improve the performance of the proposed system, the following improvements have been made: firstly, the Multi-Head attention mechanism is introduced to improve the alignment accuracy between state vectors of decoder and encoder; secondly, the label smoothing technique is adopted to solve the problem of over-fitting; thirdly, an N-gram language model is combined with the LAS model to increase the accuracy of speech recognition and the maximum mutual information (MMI) criterion is employed for discriminative training; and finally, transfer learning is utilized to overcome the problem of insufficient training data. Experimental results show that the proposed model can significantly enhance the performance of Amdo-Tibetan speech recognition.

show abstract

“…Deep learning has bridged the gap between what human perceive and what a computer understands. It has significantly improved speech recognition [33][34][35][36][37][38][39][40][41][42]. These approaches are to make the computer think like human.…”

Section: Imentioning

confidence: 99%

VoSE: An algorithm to Separate and Enhance Voices from Mixed Signals using Gradient Boosting

Gupta

Singh²,

Sinha³

2020

Preprint

View full text Add to dashboard Cite

Voice Separation and Enhancement (VoSE) algorithm aims at designing a predictive model to solve the problem of speech enhancement and separation from a mixed signal. VoSE can be used for any language, with or without a large Datasets. VoSE can be utilized by any voice response system like, Siri, Alexa, Google Assistant which as of now work on single voice command. The pre-processing of the voice is done using a Trimming Negative and Nonzero voice filter (TNNVF), designed by the authors. TNNVF is independent of language, it works on any voice signal. The segmentation of a voice is generally carried out on frequency domain or time domain. Independently they are known to have ripple or rising effect. To rule out the ripple effect, data is filtered in the time-frequency domain. Voice print of the entire sound files is created for the training and testing purpose. 80% of the voice prints are used to train the network and 20% are kept for testing. The training set contains over 48,000 voice prints. LightGBM with TensorFlow helps in generating unique voice prints in a short time. To enhance the retrieved voice signals, Enhance Predictive Voice(EPV) function is designed. The tests are conducted on English and Indian languages. The proposed work is compared with K-means, Decision Stump, Naïve Bayes, and LSTM.

show abstract

Speech Enhancement Based on Teacher–Student Deep Learning Using Improved Speech Presence Probability for Noise-Robust Speech Recognition

Cited by 79 publications

References 41 publications

Real-Time Speech Enhancement Algorithm Based on Attention LSTM

Real-Time Speech Enhancement Algorithm Based on Attention LSTM

End-to-End Amdo-Tibetan Speech Recognition Based on Knowledge Transfer

VoSE: An algorithm to Separate and Enhance Voices from Mixed Signals using Gradient Boosting

Contact Info

Product

Resources

About