“…In order to recover the underlying target speech embedded in noise, most of the deep neural networks, either recurrent [4], [5], [10] or feedforward [4], [6], [8], [9], [11], are trained to optimize some objective functions such as the mean squared error (MSE) between the true and predicted outputs. The inputs to the DNN are often (hybrid) features such as timefrequency (TF) domain spectral features [4]- [6], [8]- [10] and filterbank features [4], [5], [11]; while the output can be the TF unit level features that can be used to recover the speech source, such as ideal binary/ratio masks (IBM/IRM) [4]- [6], [11], direct magnitude spectra [9], [10] or their transforms such as log power (LP) spectra [8].…”