Sequential Multi-Frame Neural Beamforming for Speech Separation and Enhancement

Wang, Zhongqiu; Erdoğan, Hakan; Wisdom, Scott; Wilson, Kevin; Raj, Desh; Watanabe, Shinji; Chen, Zhuo; Hershey, John R.

doi:10.1109/slt48900.2021.9383522

Cited by 33 publications

(14 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We train separation networks using the same architecture as previous works [6,8,9,10], which separates sources by masking in a learned transform domain. The network is composed of a learnable encoder/decoder with 2.5 ms window and 1.25 ms hop, com-bined with a time-domain convolutional network (TDCN++).…”

Section: Methodsmentioning

confidence: 99%

Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation

Wisdom¹,

Jansen²,

Weiss³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Supervised neural network training has led to significant progress on single-channel sound separation. This approach relies on ground truth isolated sources, which precludes scaling to widely available mixture data and limits progress on open-domain tasks. The recent mixture invariant training (MixIT) method enables training on in-thewild data; however, it suffers from two outstanding problems. First, it produces models which tend to over-separate, producing more output sources than are present in the input. Second, the exponential computational complexity of the MixIT loss limits the number of feasible output sources. These problems interact: increasing the number of output sources exacerbates over-separation. In this paper we address both issues. To combat over-separation we introduce new losses: sparsity losses that favor fewer output sources and a covariance loss that discourages correlated outputs. We also experiment with a semantic classification loss by predicting weak class labels for each mixture. To extend MixIT to larger numbers of sources, we introduce an efficient approximation using a fast least-squares solution, projected onto the MixIT constraint set. Our experiments show that the proposed losses curtail over-separation and improve overall performance. The best performance is achieved using larger numbers of output sources, enabled by our efficient MixIT loss, combined with sparsity losses to prevent over-separation. On the FUSS test set, we achieve over 13 dB in multi-source SI-SNR improvement, while boosting single-source reconstruction SI-SNR by over 17 dB.

show abstract

Section: Methodsmentioning

confidence: 99%

Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation

Wisdom¹,

Jansen²,

Weiss³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Our training configurations are illustrated in Figure 1. For supervised data, we use anechoic and reverberant versions of Libri2Mix [19,20]. The anechoic version is the official clean two-speaker mixtures, and the reverberant version RLibri2Mix [13] uses synthetic impulse responses using a simulator described in previous work [20].…”

Section: Experiments Setupmentioning

confidence: 99%

Adapting Speech Separation to Real-World Meetings Using Mixture Invariant Training

Sivaraman¹,

Wisdom²,

Erdoğan³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

The recently-proposed mixture invariant training (MixIT) is an unsupervised method for training single-channel sound separation models in the sense that it does not require ground-truth isolated reference sources. In this paper, we investigate using MixIT to adapt a separation model on real far-field overlapping reverberant and noisy speech data from the AMI Corpus. The models are tested on real AMI recordings containing overlapping speech, and are evaluated subjectively by human listeners. To objectively evaluate our models, we also devise a synthetic AMI test set. For human evaluations on real recordings, we also propose a modification of the standard MUSHRA protocol to handle imperfect reference signals, which we call MUSHIRA. Holding network architectures constant, we find that a fine-tuned semi-supervised model yields the largest SI-SNR improvement, PESQ scores, and human listening ratings across synthetic and real datasets, outperforming unadapted generalist models trained on orders of magnitude more data. Our results show that unsupervised learning through MixIT enables model adaptation on real-world unlabeled spontaneous speech recordings.

show abstract

“…The room simulation is based on the image method with frequency-dependent wall filters and is described in [24]. A simulated room with width between 3-7 meters, length between 4-8 meters, and height between 2.13-3.05 meters is sampled for each mixture, with a random microphone location, and the sources in the clip are each convolved with an impulse response from a different randomly sampled location within the simulated room.…”

Section: Data Preparationmentioning

confidence: 99%

What's All the FUSS About Free Universal Sound Separation Data?

Wisdom¹,

Erdoğan²,

Ellis³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number of sounds from an open domain of sound types. The dataset consists of 23 hours of single-source audio data drawn from 357 classes, which are used to create mixtures of one to four sources. To simulate reverberation, an acoustic room simulator is used to generate impulse responses of box shaped rooms with frequencydependent reflective walls. Additional open-source data augmentation tools are also provided to produce new mixtures with different combinations of sources and room simulations. Finally, we introduce an open-source baseline separation model, based on an improved time-domain convolutional network (TDCN++), that can separate a variable number of sources in a mixture. This model achieves 9.8 dB of scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources, while reconstructing single-source inputs with 35.5 dB absolute SI-SNR. We hope this dataset will lower the barrier to new research and allow for fast iteration and application of novel techniques from other machine learning domains to the sound separation challenge.

show abstract

Sequential Multi-Frame Neural Beamforming for Speech Separation and Enhancement

Cited by 33 publications

References 35 publications

Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation

Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation

Adapting Speech Separation to Real-World Meetings Using Mixture Invariant Training

What's All the FUSS About Free Universal Sound Separation Data?

Contact Info

Product

Resources

About