What’s all the Fuss about Free Universal Sound Separation Data?

Wisdom, Scott; Erdoğan, Hakan; Ellis, Daniel P. W.; Serizel, Romain; Turpault, Nicolas; Fonseca, Eduardo; Salamon, Justin; Seetharaman, Prem; Hershey, John R.

doi:10.1109/icassp39728.2021.9414774

Cited by 44 publications

(41 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We train separation networks using the same architecture as previous works [6,8,9,10], which separates sources by masking in a learned transform domain. The network is composed of a learnable encoder/decoder with 2.5 ms window and 1.25 ms hop, com-bined with a time-domain convolutional network (TDCN++).…”

Section: Methodsmentioning

confidence: 99%

“…Separation performance is evaluated using several supervised synthetic datasets containing reference sources. Since we train our separation models on a universal sound separation task, we primarily focus on evaluation with the FUSS dataset [9], which contains 10second mixtures of one to four arbitrary sound sources drawn from Since many practical separation applications focus on speech signals, we additionally evaluate how well unsupervised universal separation models can generalize to two specific task domains: speech separation, using mixtures of two overlapping speakers from Libr2Mix [22], and speech enhancement, using the same dataset as previous work [8] in which speech is drawn from librivox.org, and nonspeech from freesound.org.…”

Section: Methodsmentioning

confidence: 99%

“…We use several separation metrics, as proposed for measuring performance on FUSS [9] and for MixIT-trained models [8].…”

Section: Methodsmentioning

confidence: 99%

“…One outstanding issue with MixIT is that it can lead to models that over-separate, in the sense of producing a greater number of active source estimates than there are true underlying sources [8,9]. This is because the MixIT loss is invariant to the number of active estimated sources, as long as they can be remixed to approximate the reference mixtures with the same error.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation

Wisdom¹,

Jansen²,

Weiss³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Supervised neural network training has led to significant progress on single-channel sound separation. This approach relies on ground truth isolated sources, which precludes scaling to widely available mixture data and limits progress on open-domain tasks. The recent mixture invariant training (MixIT) method enables training on in-thewild data; however, it suffers from two outstanding problems. First, it produces models which tend to over-separate, producing more output sources than are present in the input. Second, the exponential computational complexity of the MixIT loss limits the number of feasible output sources. These problems interact: increasing the number of output sources exacerbates over-separation. In this paper we address both issues. To combat over-separation we introduce new losses: sparsity losses that favor fewer output sources and a covariance loss that discourages correlated outputs. We also experiment with a semantic classification loss by predicting weak class labels for each mixture. To extend MixIT to larger numbers of sources, we introduce an efficient approximation using a fast least-squares solution, projected onto the MixIT constraint set. Our experiments show that the proposed losses curtail over-separation and improve overall performance. The best performance is achieved using larger numbers of output sources, enabled by our efficient MixIT loss, combined with sparsity losses to prevent over-separation. On the FUSS test set, we achieve over 13 dB in multi-source SI-SNR improvement, while boosting single-source reconstruction SI-SNR by over 17 dB.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

“…We use several separation metrics, as proposed for measuring performance on FUSS [9] and for MixIT-trained models [8].…”

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation

Wisdom¹,

Jansen²,

Weiss³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…These include AEs such as human sounds, object sounds, musical instruments, etc [10]. We generated two test sets using the FSD-Kaggle and a subset of the FSD50K data that consists of the AE sound samples from a single AE class provided in the FUSS dataset [24]. We used FSD50K to generate data with new AE classes unseen in the FSD-Kaggle training set.…”

Section: Datasetmentioning

confidence: 99%

Few-Shot Learning of New Sound Classes for Target Sound Extraction

Delcroix

Vázquez²,

Ochiai

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

Target sound extraction consists of extracting the sound of a target acoustic event (AE) class from a mixture of AE sounds. It can be realized using a neural network that extracts the target sound conditioned on a 1-hot vector that represents the desired AE class. With this approach, embedding vectors associated with the AE classes are directly optimized for the extraction of sound classes seen during training. However, it is not easy to extend this framework to new AE classes, i.e. unseen during training. Recently, speech, music, or AE sound extraction based on enrollment audio of the desired sound offers the potential of extracting any target sound in a mixture given only a short audio signal of a similar sound. In this work, we propose combining 1-hot-and enrollment-based target sound extraction, allowing optimal performance for seen AE classes and simple extension to new classes. In experiments with synthesized sound mixtures generated with the Freesound Dataset (FSD) datasets, we demonstrate the benefit of the combined framework for both seen and new AE classes. Besides, we also propose adapting the embedding vectors obtained from a few enrollment audio samples (few-shot) to further improve performance on new classes.

show abstract

Music Source Separation with Deep Convolution Neural Network

Mangal¹,

Deolalikar²

2022

ICT Infrastructure and Computing

View full text Add to dashboard Cite

What’s all the Fuss about Free Universal Sound Separation Data?

Cited by 44 publications

References 22 publications

Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation

Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation

Few-Shot Learning of New Sound Classes for Target Sound Extraction

Music Source Separation with Deep Convolution Neural Network

Contact Info

Product

Resources

About