A Deep Ensemble Learning Method for Monaural Speech Separation

Zhang, Xiaolei; Wang, DeLiang

doi:10.1109/taslp.2016.2536478

Cited by 205 publications

(107 citation statements)

References 35 publications

Supporting

Mentioning

102

Contrasting

Order By: Relevance

“…Recently deep learning has been employed to address speaker separation. The general idea is to train a deep neural network (DNN) to predict T-F masks or spectra of two speakers in a mixture [7] [16] [42]. There are usually two output layers in such a DNN, one for an individual speaker.…”

Section: Introductionmentioning

confidence: 99%

Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation

Liu

Wang

2019

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

149

View full text Add to dashboard Cite

We address talker-independent monaural speaker separation from the perspectives of deep learning and computational auditory scene analysis (CASA). Specifically, we decompose the multi-speaker separation task into the stages of simultaneous grouping and sequential grouping. Simultaneous grouping is first performed in each time frame by separating the spectra of different speakers with a permutation-invariantly trained neural network. In the second stage, the frame-level separated spectra are sequentially grouped to different speakers by a clustering network. The proposed deep CASA approach optimizes framelevel separation and speaker tracking in turn, and produces excellent results for both objectives. Experimental results on the benchmark WSJ0-2mix database show that the new approach achieves the state-of-the-art results with a modest model size.

show abstract

Section: Introductionmentioning

confidence: 99%

Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation

Liu

Wang

2019

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

149

View full text Add to dashboard Cite

show abstract

“…Lately, there has been increasing interest in nonlinear models, specifically, Deep Neural Networks (DNNs) [21,22,23,24]. In Deep Clustering (DPCL) [25,26], first, the timefrequency bins of the mixtures are mapped into an embedding space; then, a clustering algorithm is performed in the embedding space; finally, a binary mask is generated based on each cluster to reconstruct speech of each speaker.…”

Section: Introductionmentioning

confidence: 99%

Probabilistic Permutation Invariant Training for Speech Separation

2019

View full text Add to dashboard Cite

Single-microphone, speaker-independent speech separation is normally performed through two steps: (i) separating the specific speech sources, and (ii) determining the best output-label assignment to find the separation error. The second step is the main obstacle in training neural networks for speech separation. Recently proposed Permutation Invariant Training (PIT) addresses this problem by determining the output-label assignment which minimizes the separation error. In this study, we show that a major drawback of this technique is the overconfident choice of the output-label assignment, especially in the initial steps of training when the network generates unreliable outputs. To solve this problem, we propose Probabilistic PIT (Prob-PIT) which considers the output-label permutation as a discrete latent random variable with a uniform prior distribution. Prob-PIT defines a log-likelihood function based on the prior distributions and the separation errors of all permutations; it trains the speech separation networks by maximizing the loglikelihood function. Prob-PIT can be easily implemented by replacing the minimum function of PIT with a soft-minimum function. We evaluate our approach for speech separation on both TIMIT and CHiME datasets. The results show that the proposed method significantly outperforms PIT in terms of Signal to Distortion Ratio and Signal to Interference Ratio.

show abstract

“…To enhance the accuracy of weather prediction, Williams, Neilley, Koval, and McDonald () incorporated spatial‐temporal neighborhood bias information and used it to formulate a constraint‐regularized regression problem. Zhang and Wang () presented a multi‐context network, with one network averaging the output of multiple DNNs and the other stacking them together. Guzman, El‐Haliby, and Bruegge () compared the performance of four machine learning methods and their ensembles in classifying app reviews.…”

Section: Literature Reviewmentioning

confidence: 99%

A spatio‐temporal ensemble method for large‐scale traffic state prediction

Liu

et al. 2019

Computer aided Civil Eng

View full text Add to dashboard Cite

How to effectively ensemble multiple models while leveraging the spatio‐temporal information is a challenging but practical problem. However, there is no existing ensemble method explicitly designed for spatio‐temporal data. In this paper, a fully convolutional model based on semantic segmentation technology is proposed, termed as spatio‐temporal ensemble net. The proposed method is suitable for grid‐based spatio‐temporal prediction in dense urban areas. Experiments demonstrate that through spatio‐temporal ensemble net, multiple traffic state prediction base models can be combined to improve the prediction accuracy.

show abstract

A Deep Ensemble Learning Method for Monaural Speech Separation

Cited by 205 publications

References 35 publications

Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation

Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation

Probabilistic Permutation Invariant Training for Speech Separation

A spatio‐temporal ensemble method for large‐scale traffic state prediction

Contact Info

Product

Resources

About