Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers

Tu, Yan-Hui; Du, Jun; Xu, Yong; Dai, Li-Rong; Lee, Chin‐Hui

doi:10.1109/iscslp.2014.6936615

Cited by 56 publications

(32 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Contrarily to [2,9], where DNNs are trained of raw spectral features, we train the DNN on SNMF activation coefficients. Hence, to evaluate the influence of the input features of the DNN, we introduce a variant of our framework denoted (DNN-SNMF-Spec), where the DNN is learned on spectral features to predict activation coefficients, and uses the modified cost function computed on signal reconstruction.…”

Section: Methodsmentioning

confidence: 99%

“…The activation coefficients are then used as input features of the DNN, instead of raw spectral coefficients as in [9] or the log spectrum in [2]. For each frame of noisy speech (at index position t), we build a large vector composed of the concatenation of the activation coefficients of speechĥ S,t and noisê h N,t vectors extracted on each frame on an analysis windows of width (2K + 1) frames centred on the t th frame.…”

Section: Feature Extraction Using Supervised Snmfmentioning

confidence: 99%

“…This topic has been studied for more than 50 years, and has produced successful approaches, especially statistics based methods [1] able to efficiently reduce the contribution of noise in degraded signals as long as the stationary noise assumption is respected. More recently, several works based-on machine learning algorithms such as Sparse Non-negative Matrix Factorization (SNMF) and Deep Neural Networks (DNN) have achieved significant improvements for non-stationary noises [2,3].…”

Section: Introductionmentioning

confidence: 99%

“…DNN-based SE [2,10] relies on the ability of Deep Neural Networks to estimate complex non-linear functions used to directly map log spectral features of noisy speech into corresponding clean speech signals and therefore may be more efficient in separating noise and speech in case of overlapping sub-domains. Temporal dependencies of speech are usually considered by extracting features on sliding context windows.…”

Section: Introductionmentioning

confidence: 99%

“…Our proposal has been evaluated both for the tasks of Speech Enhancement and Automatic Speech Recognition (ASR) using several objective metrics. Our results have been systematically compared to several state-of-the-art DNN and SNMF-based SE systems [2,3,4,9]. Evaluations have been conducted using the framework provided recently by the CHiME-3 challenge [11] on speech separation and recognition on challenging real noisy speech recordings.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Combining non-negative matrix factorization and deep neural networks for speech enhancement and automatic speech recognition

Bigot

Chng

2016

2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Sparse Non-negative Matrix Factorization (SNMF) and Deep Neural Networks (DNN) have emerged individually as two efficient machine learning techniques for single-channel speech enhancement. Nevertheless, there are only few works investigating the combination of SNMF and DNN for speech enhancement and robust Automatic Speech Recognition (ASR). In this paper, we present a novel combination of speech enhancement components based-on SNMF and DNN into a full-stack system. We refine the cost function of the DNN to back-propagate the reconstruction error of the enhanced speech. Our proposal is compared with several state-of-the-art speech enhancement systems. Evaluations are conducted on the data of CHiME-3 challenge which consists of real noisy speech recordings captured under challenging noisy conditions. Our system yields significant improvements for both objective quality speech enhancement measurements with relative gain of 30%, and a 10% relative Word Error Rate reduction for ASR compared to the best baselines.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Feature Extraction Using Supervised Snmfmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%