Enhancing the correlation between the quality and intelligibility objective metrics with the subjective scores by shallow feed forward neural network for time–frequency masking speech separation algorithms

Gul, Sania; Khan, Muhammad Salman; Yoma, Néstor Becerra; Shah, Syed Waqar; Sheheryar,

doi:10.1016/j.apacoust.2021.108539

Cited by 5 publications

(4 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…SRMR is a non-intrusive metric, so no reference signal is required for its estimation [ 80 ], whereas the rest of the metrics are intrusive metrics and thus require the clean speech sample as a reference for the performance evaluation [ 80 ]. Among these metrics, PESQ and STOI are known to correlate well with the human perception of quality and intelligibility [ 81 ]. SRMR metric is commonly used to evaluate speech dereverberation algorithms and reflect the quality and intelligibility of the reverberant speech [ 57 ].…”

Section: Methodsmentioning

confidence: 99%

“…SRMR metric is commonly used to evaluate speech dereverberation algorithms and reflect the quality and intelligibility of the reverberant speech [ 57 ]. SDR shows the estimated speech quality by comparing the estimated signal energy with all kinds of distortions [ 81 ]. CD measures the similarity between short-time spectra of the estimated and clean speech [ 81 ].…”

Section: Methodsmentioning

confidence: 99%

“…SDR shows the estimated speech quality by comparing the estimated signal energy with all kinds of distortions [ 81 ]. CD measures the similarity between short-time spectra of the estimated and clean speech [ 81 ].…”

Section: Methodsmentioning

confidence: 99%

See 2 more Smart Citations

Triple-0: Zero-shot denoising and dereverberation on an end-to-end frozen anechoic speech separation network

Gul,

Khan,

Ur-Rehman

2024

PLoS ONE

View full text Add to dashboard Cite

Speech enhancement is crucial both for human and machine listening applications. Over the last decade, the use of deep learning for speech enhancement has resulted in tremendous improvement over the classical signal processing and machine learning methods. However, training a deep neural network is not only time-consuming; it also requires extensive computational resources and a large training dataset. Transfer learning, i.e. using a pretrained network for a new task, comes to the rescue by reducing the amount of training time, computational resources, and the required dataset, but the network still needs to be fine-tuned for the new task. This paper presents a novel method of speech denoising and dereverberation (SD&D) on an end-to-end frozen binaural anechoic speech separation network. The frozen network requires neither any architectural change nor any fine-tuning for the new task, as is usually required for transfer learning. The interaural cues of a source placed inside noisy and echoic surroundings are given as input to this pretrained network to extract the target speech from noise and reverberation. Although the pretrained model used in this paper has never seen noisy reverberant conditions during its training, it performs satisfactorily for zero-shot testing (ZST) under these conditions. It is because the pretrained model used here has been trained on the direct-path interaural cues of an active source and so it can recognize them even in the presence of echoes and noise. ZST on the same dataset on which the pretrained network was trained (homo-corpus) for the unseen class of interference, has shown considerable improvement over the weighted prediction error (WPE) algorithm in terms of four objective speech quality and intelligibility metrics. Also, the proposed model offers similar performance provided by a deep learning SD&D algorithm for this dataset under varying conditions of noise and reverberations. Similarly, ZST on a different dataset has provided an improvement in intelligibility and almost equivalent quality as provided by the WPE algorithm.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Triple-0: Zero-shot denoising and dereverberation on an end-to-end frozen anechoic speech separation network

Gul,

Khan,

Ur-Rehman

2024

PLoS ONE

View full text Add to dashboard Cite

show abstract

“…Also, this model is restricted for anechoic conditions. This problem was resolved in [113] by using SONET with Expectation Maximization (EM) (a machine learning algorithm), which outperforms its constituent systems, both under anechoic and reverberant conditions, as indicated by the results of subjective listening tests in [114]. The most interesting fact about the SSS model in [113] is that it uses the anechoic pre-trained model ‗SONET', without any need for retraining, to tackle the echoes.…”

Section: Viiiiiii Speech Source Separation (Sss)mentioning

confidence: 99%

A Survey of Audio Enhancement Algorithms for Music, Speech, Bioacoustics, Biomedical, Industrial, and Environmental Sounds by Image U-Net

Gul,

Khan

2023

IEEE Access

Self Cite

View full text Add to dashboard Cite

The recent surge in the use of Deep Neural Networks (DNNs) has also made its mark in the field of Audio Enhancement (AE), providing much better quality than the classical methods. Although, there are dedicated audio processing DNNs, yet, many recent models of AE have utilized U-Net: a DNN based on Convolutional Neural Network (CNN), fundamentally developed for image segmentation. It is found that the useful features hidden in the time domain are highlighted when the audio signal is converted to a spectrogram, which can be treated as an image. In this article, we will review the recent work, utilizing U-Nets for different AE applications. Different than other published reviews, this review focuses entirely on AE techniques based on image U-Nets. We will discuss the need for AE, U-Net comparison to other DNNs, the benefits of converting the audio to 2D, input representations that are useful for different AE applications, the architecture of vanilla U-Net and the pre-trained models, variations in vanilla architecture incorporated in different E models, and the state-of-the-art AE algorithms based on U-Net in various applications. Apart from speech and music, this article discusses a wide range of audio signals e.g. environmental, biomedical, bioacoustics, and industrial sounds, not covered collectively in a single article in previously published studies. The article ends with the discussion of colored spectrograms in future AE applications.INDEX TERMS CNNs; image processing deep neural networks; pre-trained networks; spectrogram.;U-Net.

show abstract

LaSNet: An end-to-end network based on steering vector filter for sound source localization and separation

Yang,

Zhang,

et al. 2023

Applied Acoustics

View full text Add to dashboard Cite

Enhancing the correlation between the quality and intelligibility objective metrics with the subjective scores by shallow feed forward neural network for time–frequency masking speech separation algorithms

Cited by 5 publications

References 31 publications

Triple-0: Zero-shot denoising and dereverberation on an end-to-end frozen anechoic speech separation network

Triple-0: Zero-shot denoising and dereverberation on an end-to-end frozen anechoic speech separation network

A Survey of Audio Enhancement Algorithms for Music, Speech, Bioacoustics, Biomedical, Industrial, and Environmental Sounds by Image U-Net

LaSNet: An end-to-end network based on steering vector filter for sound source localization and separation

Contact Info

Product

Resources

About