Source Separation with Weakly Labelled Data: an Approach to Computational Auditory Scene Analysis

Kong, Qiuqiang; Wang, Yuxuan; Song, Xuchen; Cao, Yin; Wang, Wenwu; Plumbley, Mark D.

doi:10.1109/icassp40776.2020.9053396

Cited by 32 publications

(61 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Tzinis et al [4] performed separation experiments with a fixed number of sources on the 50-class ESC-50 dataset [5]. Other papers have leveraged information about sound class, either as conditioning information or as as a weak supervision signal [6,2,7].…”

Section: Relation To Prior Workmentioning

confidence: 99%

See 1 more Smart Citation

What’s all the Fuss about Free Universal Sound Separation Data?

Wisdom

Erdoğan

Ellis

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number of sounds from an open domain of sound types. The dataset consists of 23 hours of single-source audio data drawn from 357 classes, which are used to create mixtures of one to four sources. To simulate reverberation, an acoustic room simulator is used to generate impulse responses of box-shaped rooms with frequencydependent reflective walls. Additional open-source data augmentation tools are also provided to produce new mixtures with different combinations of sources and room simulations. Finally, we introduce an open-source baseline separation model, based on an improved time-domain convolutional network (TDCN++), that can separate a variable number of sources in a mixture. This model achieves 9.8 dB of scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources, while reconstructing single-source inputs with 35.8 dB absolute SI-SNR. We hope this dataset will lower the barrier to new research and allow for fast iteration and application of novel techniques from other machine learning domains to the sound separation challenge.

show abstract

Section: Relation To Prior Workmentioning

confidence: 99%

“…However, none of these approaches explicitly solved the problem of non-target events. Sound separation can be used for SED by first separating the component sounds in a mixed signal and then applying SED on each of the separated tracks [15,7,16,17,18].…”

Section: Relation To Prior Workmentioning

confidence: 99%

What’s all the Fuss about Free Universal Sound Separation Data?

Wisdom

Erdoğan

Ellis

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…al. [8], who also train a neural network for conditional source-separation of singlechannel audio. This approach uses a classification model trained on AudioSet [5], which consists of 10s segments with weak class labels.…”

Section: Related Workmentioning

confidence: 99%

“…For each such class, a one-hot vector indicating the selected class is then used to extract the different sources. In short, a key difference between Kong et al [8] and our approach is that the former requires labeled data to train the classifier model, whereas our SoundFilter operates in a fully unlabeled setup. In addition, the embedding used in [8] is defined in terms of AudioSet's class ontology.…”

Section: Related Workmentioning

confidence: 99%

One-Shot Conditional Audio Filtering of Arbitrary Sounds

Gfeller

Roblek

Tagliasacchi

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We consider the problem of separating a particular sound source from a single-channel mixture, based on only a short sample of the target source (from the same recording). Using SoundFilter, a waveto-wave neural network architecture, we can train a model without using any sound class labels. Using a conditioning encoder model which is learned jointly with the source separation network, the trained model can be "configured" to filter arbitrary sound sources, even ones that it has not seen during training. Evaluated on the FSD50k dataset, our model obtains an SI-SDR improvement of 9.6 dB for mixtures of two sounds. When trained on Librispeech, our model achieves an SI-SDR improvement of 14.0 dB when separating one voice from a mixture of two speakers. Moreover, we show that the representation learned by the conditioning encoder clusters acoustically similar sounds together in the embedding space, even though it is trained without using any labels.

show abstract

“…For speech separation, different methods have been designed [9][10][11][12][13][14]. Approaches such as Computational Auditory Scene Analysis (CASA) [15][16][17][18], Hidden Markov Model (HMM) [19][20][21], HMM in conjunction with Cepstral Coefficients for Mel Frequency [22][23][24], Nonnegative Factorization of Matrix(NMF) [25][26][27][28] and Minimal Mean Square Error(MMSE) [29][30][31][32]. However, these strategies have seen relatively little success.…”

Section: Imentioning

confidence: 99%

VoSE: An algorithm to Separate and Enhance Voices from Mixed Signals using Gradient Boosting

Gupta

Singh²,

Sinha³

2020

Preprint

View full text Add to dashboard Cite

Voice Separation and Enhancement (VoSE) algorithm aims at designing a predictive model to solve the problem of speech enhancement and separation from a mixed signal. VoSE can be used for any language, with or without a large Datasets. VoSE can be utilized by any voice response system like, Siri, Alexa, Google Assistant which as of now work on single voice command. The pre-processing of the voice is done using a Trimming Negative and Nonzero voice filter (TNNVF), designed by the authors. TNNVF is independent of language, it works on any voice signal. The segmentation of a voice is generally carried out on frequency domain or time domain. Independently they are known to have ripple or rising effect. To rule out the ripple effect, data is filtered in the time-frequency domain. Voice print of the entire sound files is created for the training and testing purpose. 80% of the voice prints are used to train the network and 20% are kept for testing. The training set contains over 48,000 voice prints. LightGBM with TensorFlow helps in generating unique voice prints in a short time. To enhance the retrieved voice signals, Enhance Predictive Voice(EPV) function is designed. The tests are conducted on English and Indian languages. The proposed work is compared with K-means, Decision Stump, Naïve Bayes, and LSTM.

show abstract

Source Separation with Weakly Labelled Data: an Approach to Computational Auditory Scene Analysis

Cited by 32 publications

References 21 publications

What’s all the Fuss about Free Universal Sound Separation Data?

What’s all the Fuss about Free Universal Sound Separation Data?

One-Shot Conditional Audio Filtering of Arbitrary Sounds

VoSE: An algorithm to Separate and Enhance Voices from Mixed Signals using Gradient Boosting

Contact Info

Product

Resources

About