An Experimental Analysis of Deep Learning Architectures for Supervised Speech Enhancement

Nossier, Soha A.; Wall, Julie; Moniri, M.; Glackin, Cornelius; Cannings, Nigel

doi:10.3390/electronics10010017

Cited by 33 publications

(31 citation statements)

References 107 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In recent years, the development of deep learning technology has substantially improved the performance of speech processing algorithms such as automatic speech recognition (ASR) [ 1 , 2 ], speech separation [ 3 ], and speech enhancement [ 4 ]. Among them, ASR has been popularly deployed for voice-enabled information retrieval using artificial intelligence (AI) speakers and chatbots [ 5 , 6 , 7 , 8 ].…”

Section: Introductionmentioning

confidence: 99%

“…Traditionally, speech enhancement techniques have been developed to enhance speech quality for voice communications equipped with a single-channel microphone or a multi-channel microphone array [ 4 , 26 , 27 , 28 , 29 , 30 , 31 , 32 , 33 ]. To overcome the noise robustness of ASR, the developed speech enhancement algorithm can be used as a front-end of ASR.…”

Section: Introductionmentioning

confidence: 99%

“…To overcome the noise robustness of ASR, the developed speech enhancement algorithm can be used as a front-end of ASR. Among various types of speech enhancement algorithms, deep learning-based speech enhancement models achieved a superior performance compared to conventional statistical methods [ 4 ]. In particular, U-Net-based speech enhancement models showed better performance than other neural network architectures [ 28 ].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Two-Step Joint Optimization with Auxiliary Loss Function for Noise-Robust Speech Recognition

Lee

Kim²

2022

Sensors

View full text Add to dashboard Cite

In this paper, a new two-step joint optimization approach based on the asynchronous subregion optimization method is proposed for training a pipeline model composed of two different models. The first-step processing of the proposed joint optimization approach trains the front-end model only, and the second-step processing trains all the parameters of the combined model together. In the asynchronous subregion optimization method, the first-step processing only supports the goal of the front-end model. However, the first-step processing of the proposed approach works with a new loss function to make the front-end model support the goal of the back-end model. The proposed optimization approach was applied, here, to a pipeline composed of a deep complex convolutional recurrent network (DCCRN)-based speech enhancement model and a conformer-transducer-based ASR model as a front-end and a back-end, respectively. Then, the performance of the proposed two-step joint optimization approach was evaluated on the LibriSpeech automatic speech recognition (ASR) corpus in noisy environments by measuring the character error rate (CER) and word error rate (WER). In addition, an ablation study was carried out to examine the effectiveness of the proposed optimization approach on each of the processing blocks in the conformer-transducer ASR model. Consequently, it was shown from the ablation study that the conformer-transducer-based ASR model with the joint network trained only by the proposed optimization approach achieved the lowest average CER and WER. Moreover, the proposed optimization approach reduced the average CER and WER on the Test-Noisy dataset under matched noise conditions by 0.30% and 0.48%, respectively, compared to the approach of separate optimization of speech enhancement and ASR. Compared to the conventional two-step joint optimization approach, the proposed optimization approach provided average CER and WER reductions of 0.22% and 0.31%, respectively. Moreover, it was revealed that the proposed optimization approach achieved a lower average CER and WER, by 0.32% and 0.43%, respectively, than the conventional optimization approach under mismatched noise conditions.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Two-Step Joint Optimization with Auxiliary Loss Function for Noise-Robust Speech Recognition

Lee

Kim²

2022

Sensors

View full text Add to dashboard Cite

show abstract

“…The data-driven approach (Zhao et al 2018) of the deep neural network makes it more efficient and is responsive to untrained conditions and unseen noises. In the recent past, the commonly used techniques for supervised speech enhancement (Nossier et al 2021) technique include the mapping in the frequency domain or time-frequency masking. The speech signal is converted from the frequency domain to the time domain.…”

Section: Introductionmentioning

confidence: 99%

Enhancement of single channel speech quality and intelligibility in multiple noise conditions using wiener filter and deep CNN

Hepsiba

Justin

2021

Soft Comput

View full text Add to dashboard Cite

Nowadays, deep neural network has become the prime approach for enhancing speech signals as it yields good results compared to the traditional methods. This paper describes the transformation in the enhanced speech signal by applying the deep convolutional neural network (Deep CNN), which can model nonlinear relationships and compare it with the Wiener filtering method, which is the best technique for speech enhancement among the traditional methods. Denoising is performed in the frequency domain and converted back to the time domain to analyze performance metrics such as speech quality and speech intelligibility. The speech quality is analyzed based on the signal to noise ratio (SNR) and perceptual evaluation of speech quality (PESQ). Speech intelligibility is analyzed by short-time objective intelligibility (STOI). Both the methods evaluated the denoised speech, and the analysis made on the results shows that the SNR of the conventional Wiener filtering method is much improved when compared with Deep CNN. However, the PESQ and STOI of Deep CNNbased enhanced speech outperform the Wiener filtering method. The performance metrics indicate that Deep CNN achieves better results than the conventional technique.

show abstract

“…Nossier et al [20] have demonstrated a comparative analysis on the basis of three classes including the initially proposed Deep Multi-layer Perceptron (MLP), Convolutional Neural Networks (CNN), and Denoising Autoencoder (DAE). The work carried out investigates the impact of network hyperparameter changes and data arrangement on the performance together with the Lombard effect.…”

Section: Introductionmentioning

confidence: 99%

A Fully Connected Deep Neural Network approach with multiple sub-frame consideration and phase recompense for noise suppression

Nisa¹

2021

Preprint

View full text Add to dashboard Cite

In the speech communication process, the desirable speech needs to be addressed under the influence of noise encountered in diverse environments that degrade the speech quality and intelligibility. In opposition to the unfavorable scenario particularly lowered signal-to-noiseratio, the progress of traditional noise suppressive algorithms is hindered, introducing further distortion in speech, making them non-applicable for real-time applications. In order to reduce the complicacies of current algorithms, a hybrid approach for upgrading the quality together with intelligibility of speech is proposed for dealing with real-world hearing scenario. For improving the intelligibility of speech of interest, multiple sub-frame analysis using over-spectral subtractive factor with phase recompense approach is implemented on the multi-channel noise corrupted speech, yielding approximated speech spectrum, that constitutes the pre-processing stage. The approximated speech spectrum and clean speech spectrum forming the training set are further fed to Fully Connected Layered Deep Neural Network to reduce the mean square error with the incorporation of regression network resulting in improved quality for speech. The proposed hybrid network results in upgraded intelligibility and quality in speech signal with improved SNR measured in terms of Short-Time-Objective-Intelligibility (STOI) score, Perceptual-Evaluation-of-Speech-Quality (PESQ) score, Segmental SNR level, and Mean Square Error (MSE) in contrast to prior noise suppressive algorithms together with less complexity of the hybrid network.<br>

show abstract

An Experimental Analysis of Deep Learning Architectures for Supervised Speech Enhancement

Cited by 33 publications

References 107 publications

Two-Step Joint Optimization with Auxiliary Loss Function for Noise-Robust Speech Recognition

Two-Step Joint Optimization with Auxiliary Loss Function for Noise-Robust Speech Recognition

Enhancement of single channel speech quality and intelligibility in multiple noise conditions using wiener filter and deep CNN

A Fully Connected Deep Neural Network approach with multiple sub-frame consideration and phase recompense for noise suppression

Contact Info

Product

Resources

About