A Regression Approach to Single-Channel Speech Separation Via High-Resolution Deep Neural Networks

Du, Jun; Tu, Yan-Hui; Dai, Li-Rong; Lee, Chin‐Hui

doi:10.1109/taslp.2016.2558822

Cited by 83 publications

(35 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This idea is depicted in Figure 1 where we learn a model to recover the filter bank (FBANK) features from the mixed FBANK features and then feed each stream of the recovered FBANK features to a conventional LVCSR system for recognition. In the simplest architecture, which is denoted as Arch#1 and illustrated in Figure 1(a), feature separation can be considered as a multi-class regression problem, similar to many previous works [29], [30], [31], [32], [33], [34]. In this architecture, Y, the feature of mixed speech, are used as the input to some deep learning models, such as deep neural networks (DNNs), convolutional neural networks (CNNs), and long short-term memory (LSTM) recurrent neural networks (RNNs), to estimate feature representation of each individual talker.…”

Section: A Feature Separation With Direct Supervisionmentioning

confidence: 99%

Single-channel multi-talker speech recognition with permutation invariant training

Chang

2018

Speech Communication

View full text Add to dashboard Cite

Although great progresses have been made in automatic speech recognition (ASR), significant performance degradation is still observed when recognizing multi-talker mixed speech. In this paper, we propose and evaluate several architectures to address this problem under the assumption that only a single channel of mixed signal is available. Our technique extends permutation invariant training (PIT) by introducing the frontend feature separation module with the minimum mean square error (MSE) criterion and the back-end recognition module with the minimum cross entropy (CE) criterion. More specifically, during training we compute the average MSE or CE over the whole utterance for each possible utterance-level output-target assignment, pick the one with the minimum MSE or CE, and optimize for that assignment. This strategy elegantly solves the label permutation problem observed in the deep learning based multi-talker mixed speech separation and recognition systems. The proposed architectures are evaluated and compared on an artificially mixed AMI dataset with both two-and threetalker mixed speech. The experimental results indicate that our proposed architectures can cut the word error rate (WER) by 45.0% and 25.0% relatively against the state-of-the-art singletalker speech recognition system across all speakers when their energies are comparable, for two-and three-talker mixed speech, respectively. To our knowledge, this is the first work on the multi-talker mixed speech recognition on the challenging speakerindependent spontaneous large vocabulary continuous speech task.

show abstract

Section: A Feature Separation With Direct Supervisionmentioning

confidence: 99%

Single-channel multi-talker speech recognition with permutation invariant training

Chang

2018

Speech Communication

View full text Add to dashboard Cite

show abstract

“…Secondly, a sampling method needs to be defined for eqns. (9) and (11), that incorporates both clean and noisy inputs. To do this, a pair of corresponding clean and noisy spectra sequences, z, are sampled, from which G generates a fake samplex t .…”

Section: Discriminator Architecturementioning

confidence: 99%

A Conditional Generative Model for Speech Enhancement

Zheng-xi

Dai

Yan

et al. 2018

Circuits Syst Signal Process

Self Cite

View full text Add to dashboard Cite

show abstract

“…Such situations require the ability to separate the voice of a particular speaker from the mixed audio signal of others. Several proposed systems have shown significant performance improvement on the separation task when prior information of speakers in a mixture is given [1], [2]. This however is still challenging when no prior information about the speakers is available, a problem known as speaker-independent speech separation.…”

Section: Introductionmentioning

confidence: 99%

Speaker-Independent Speech Separation With Deep Attractor Network

Luo

Chen

Mesgarani

2018

IEEE/ACM Trans. Audio Speech Lang. Process.

209

139

View full text Add to dashboard Cite

Despite the recent success of deep learning for many speech processing tasks, single-microphone, speaker-independent speech separation remains challenging for two main reasons. The first reason is the arbitrary order of the target and masker speakers in the mixture (permutation problem), and the second is the unknown number of speakers in the mixture (output dimension problem). We propose a novel deep learning framework for speech separation that addresses both of these issues. We use a neural network to project the time-frequency representation of the mixture signal into a high-dimensional embedding space. A reference point (attractor) is created in the embedding space to represent each speaker which is defined as the centroid of the speaker in the embedding space. The time-frequency embeddings of each speaker are then forced to cluster around the corresponding attractor point which is used to determine the time-frequency assignment of the speaker. We propose three methods for finding the attractors for each source in the embedding space and compare their advantages and limitations. The objective function for the network is standard signal reconstruction error which enables end-to-end operation during both training and test phases. We evaluated our system using the Wall Street Journal dataset (WSJ0) on two and three speaker mixtures and report comparable or better performance than other state-of-the-art deep learning methods for speech separation.

show abstract

A Regression Approach to Single-Channel Speech Separation Via High-Resolution Deep Neural Networks

Cited by 83 publications

References 52 publications

Single-channel multi-talker speech recognition with permutation invariant training

Single-channel multi-talker speech recognition with permutation invariant training

A Conditional Generative Model for Speech Enhancement

Speaker-Independent Speech Separation With Deep Attractor Network

Contact Info

Product

Resources

About