A new feature set for masking-based monaural speech separation

Pirhosseinloo, Shadi; Brumberg, Jonathan S.

doi:10.1109/acssc.2018.8645469

Cited by 7 publications

(3 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the time domain, it is common to use the original representation of the waveform or use short time frames and extract some features, such as energy, entropy, and the Zero Crossing Rate (ZCR) [18]. While, in the frequency domain, many meaningful features can be extracted, including Short Time Fourier Transform (STFT), Mel-Frequency Cepstral Coefficients (MFCC) [19], Gammatone Frequency (GF), Gammatone Frequency Cepstral Coefficients (GFCC) [20], and Perceptual Linear Prediction (PLP) [21].…”

Section: Data Structurementioning

confidence: 99%

An Experimental Analysis of Deep Learning Architectures for Supervised Speech Enhancement

et al. 2020

View full text Add to dashboard Cite

Recent speech enhancement research has shown that deep learning techniques are very effective in removing background noise. Many deep neural networks are being proposed, showing promising results for improving overall speech perception. The Deep Multilayer Perceptron, Convolutional Neural Networks, and the Denoising Autoencoder are well-established architectures for speech enhancement; however, choosing between different deep learning models has been mainly empirical. Consequently, a comparative analysis is needed between these three architecture types in order to show the factors affecting their performance. In this paper, this analysis is presented by comparing seven deep learning models that belong to these three categories. The comparison includes evaluating the performance in terms of the overall quality of the output speech using five objective evaluation metrics and a subjective evaluation with 23 listeners; the ability to deal with challenging noise conditions; generalization ability; complexity; and, processing time. Further analysis is then provided while using two different approaches. The first approach investigates how the performance is affected by changing network hyperparameters and the structure of the data, including the Lombard effect. While the second approach interprets the results by visualizing the spectrogram of the output layer of all the investigated models, and the spectrograms of the hidden layers of the convolutional neural network architecture. Finally, a general evaluation is performed for supervised deep learning-based speech enhancement while using SWOC analysis, to discuss the technique’s Strengths, Weaknesses, Opportunities, and Challenges. The results of this paper contribute to the understanding of how different deep neural networks perform the speech enhancement task, highlight the strengths and weaknesses of each architecture, and provide recommendations for achieving better performance. This work facilitates the development of better deep neural networks for speech enhancement in the future.

show abstract

Section: Data Structurementioning

confidence: 99%

An Experimental Analysis of Deep Learning Architectures for Supervised Speech Enhancement

et al. 2020

View full text Add to dashboard Cite

show abstract

“…There are other features that can be extracted from the spectrogram, such as the power spectrum, which shows the distribution of the power of the frequency components of the speech; Mel spectrum, which represents the spectrum in the Mel scale; and log power spectrum, in which the log operation is performed to the power spectrum in order to decrease the dynamic range, and ease the training process [18]. Mel-Frequency Cepstral Coefficients (MFCC) is another feature extracted by applying a Discrete Cosine Transform (DCT) to the log-compressed Mel scale power spectrum.…”

Section: A Spectrogram Based T-f Mapping Targetsmentioning

confidence: 99%

Mapping and Masking Targets Comparison using Different Deep Learning based Speech Enhancement Architectures

Nossier

Wall

Moniri

et al. 2020

2020 International Joint Conference on Neural Networks (IJCNN)

View full text Add to dashboard Cite

Mapping and Masking targets are both widely used in recent Deep Neural Network (DNN) based supervised speech enhancement. Masking targets are proved to have a positive impact on the intelligibility of the output speech, while mapping targets are found, in other studies, to generate speech with better quality. However, most of the studies are based on comparing the two approaches using the Multilayer Perceptron (MLP) architecture only. With the emergence of new architectures that outperform the MLP, a more generalized comparison is needed between mapping and masking approaches. In this paper, a complete comparison will be conducted between mapping and masking targets using four different DNN based speech enhancement architectures, to work out how the performance of the networks changes with the chosen training target. The results show that there is no perfect training target with respect to all the different speech quality evaluation metrics, and that there is a tradeoff between the denoising process and the intelligibility of the output speech. Furthermore, the generalization ability of the networks was evaluated, and it is concluded that the design of the architecture restricts the choice of the training target, because masking targets result in significant performance degradation for deep convolutional autoencoder architecture.

show abstract

“…The ideal binary mask (IBM) [3], ideal ratio mask (IRM) [4] were proposed as training targets for masking based supervised speech separation, while target magnitude spectrum (TMS) [5] was used as a training target in mapping based supervised speech separation. Furthermore, recent studies have examined the effect of different input acoustic features (e.g., gammatone based features versus spectral features) [6,7] on supervised speech separation in noisy and reverberant condition.…”

Section: Introductionmentioning

confidence: 99%

Monaural Speech Enhancement with Dilated Convolutions

Pirhosseinloo

Brumberg

2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

In this study, we propose a novel dilated convolutional neural network for enhancing speech in noisy and reverberant environments. The proposed model incorporates dilated convolutions for tracking a target speaker through context aggregations, skip connections, and residual learning for mapping-based monaural speech enhancement. The performance of our model was evaluated in a variety of simulated environments having different reverberation times and quantified using two objective measures. Experimental results show that the proposed model outperforms a long short-term memory (LSTM), a gated residual network (GRN) and convolutional recurrent network (CRN) model in terms of objective speech intelligibility and speech quality in noisy and reverberant environments. Compared to LSTM, CRN and GRN, our method has improved generalization to untrained speakers and noise, and has fewer training parameters resulting in greater computational efficiency.

show abstract

A new feature set for masking-based monaural speech separation

Cited by 7 publications

References 14 publications

An Experimental Analysis of Deep Learning Architectures for Supervised Speech Enhancement

An Experimental Analysis of Deep Learning Architectures for Supervised Speech Enhancement

Mapping and Masking Targets Comparison using Different Deep Learning based Speech Enhancement Architectures

Monaural Speech Enhancement with Dilated Convolutions

Contact Info

Product

Resources

About