Single channel audio source separation using convolutional denoising autoencoders

Grais, Emad M.; Plumbley, Mark D.

doi:10.1109/globalsip.2017.8309164

Cited by 87 publications

(50 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A variety of networks have been successfully applied to twosource separation problems, including LSTMs and bidirectional LSTMs (BLSTMs) [2,4], U-Nets [15], Wasserstein GANs [16], and fully convolutional network (FCN) encoder-decoders followed by a BLSTM [17]. For multi-source separation, a variety of architectures have been used that directly generate a mask for each source, including BLSTMs [6,9], CNNs [18], DenseNets followed by an LSTM [19], separate encoder-decoder networks for each source [20], joint one-to-many encoder-decoder networks with o decoder per source [21], and TDCNs with learnable analysis- Figure 1: Architecture for mask-based separation experiments. We vary the mask network and analysis/synthesis transforms.…”

Section: Prior Workmentioning

confidence: 99%

“…Previous source separation work has focused on speech enhancement and speech separation [6,16,22,23]. Small datasets used for the non-speech multi-source separation setting have included distress sounds from DCASE 2017 [18], and speech and music in SiSEC-2015 [17,20]. Singing voice separation has focused on vocal and music instrument tracks [15,24].…”

Section: Prior Workmentioning

confidence: 99%

See 1 more Smart Citation

Universal Sound Separation

Kavalerov

Wisdom

Erdoğan

et al. 2019

2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

145

138

View full text Add to dashboard Cite

Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation, and it is unknown how performance on speech tasks carries over to non-speech tasks. To study this question, we develop a dataset of mixtures containing arbitrary sounds, and use it to investigate the space of mask-based separation architectures, varying both the overall network architecture and the framewise analysis-synthesis basis for signal transformations. These network architectures include convolutional long short-term memory networks and time-dilated convolution stacks inspired by the recent success of time-domain enhancement networks like ConvTasNet. For the latter architecture, we also propose novel modifications that further improve separation performance. In terms of the framewise analysis-synthesis basis, we explore both a short-time Fourier transform (STFT) and a learnable basis, as used in ConvTasNet. For both of these bases, we also examine the effect of window size. In particular, for STFTs, we find that longer windows (25-50 ms) work best for speech/non-speech separation, while shorter windows (2.5 ms) work best for arbitrary sounds. For learnable bases, shorter windows (2.5 ms) work best on all tasks. Surprisingly, for universal sound separation, STFTs outperform learnable bases. Our best methods produce an improvement in scale-invariant signal-todistortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.

show abstract

Section: Prior Workmentioning

confidence: 99%

Section: Prior Workmentioning

confidence: 99%

Universal Sound Separation

Kavalerov

Wisdom

Erdoğan

et al. 2019

2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

145

138

View full text Add to dashboard Cite

show abstract

“…Neural network based regression methods have been used to solve music separation and speech separation in [6,7,8,10,11]. Regression based source separation methods learn a mapping from a mixture of sources to a target source to be separated.…”

Section: Regression Based Source Separationmentioning

confidence: 99%

Source Separation with Weakly Labelled Data: an Approach to Computational Auditory Scene Analysis

Kong

Wang

Song

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Source separation is the task to separate an audio recording into individual sound sources. Source separation is fundamental for computational auditory scene analysis. Previous work on source separation has focused on separating particular sound classes such as speech and music. Many of previous work require mixture and clean source pairs for training. In this work, we propose a source separation framework trained with weakly labelled data. Weakly labelled data only contains the tags of an audio clip, without the occurrence time of sound events. We first train a sound event detection system with Au-dioSet. The trained sound event detection system is used to detect segments that are mostly like to contain a target sound event. Then a regression is learnt from a mixture of two randomly selected segments to a target segment conditioned on the audio tagging prediction of the target segment. Our proposed system can separate 527 kinds of sound classes from AudioSet within a single system. A U-Net is adopted for the separation system and achieves an average SDR of 5.67 dB over 527 sound classes in AudioSet.Index Terms-Source separation, weakly labelled data, computational auditory scene analysis, AudioSet.

show abstract

“…On the other hand, convolutional layers, as used in convolutional neural networks (CNNs), make use of a set of small filters and share their weights among all locations in the data (e.g., LeCun et al 1998b), which allows to better capture the local features in the data. Therefore, CNNs generally have 2 or more orders of magnitude less parameters than the analogous fully connected neural networks (e.g., Grais & Plumbley 2017) and require much less training resources such as memory and time. Furthermore, multiple convolutional layers can be easily stacked to extract sophisticated higher level features by composing the lower-level ones obtained in previous layers.…”

Section: Convolutional Denoising Autoencodermentioning

confidence: 99%

“…Among various deep learning algorithms, the autoencoder is a common type of neural networks that aims at learning useful features from the input data in an unsupervised manner, and it is usually used for dimensionality reduction (e.g., Hinton & Salakhutdinov 2006;Wang et al 2014) and data denoising (e.g., Xie et al 2012;Bengio et al 2013;Lu et al 2013). In particular, the convolutional denoising autoencoder (CDAE) is very flexible and powerful in capturing subtle and complicated features in the data and have been successfully applied to weak gravitational wave signal denoising (e.g., Shen et al 2017), monaural audio source separation (e.g., Grais & Plumbley 2017), and so on. These applications have demonstrated the outstanding abilities of the CDAE in extracting weak signals from highly temporalvariable data.…”

Section: Introductionmentioning

confidence: 99%

Separating the EoR signal with a convolutional denoising autoencoder: a deep-learning-based method

et al. 2019

Monthly Notices of the Royal Astronomical Society

View full text Add to dashboard Cite

When applying the foreground removal methods to uncover the faint cosmological signal from the epoch of reionization (EoR), the foreground spectra are assumed to be smooth. However, this assumption can be seriously violated in practice since the unresolved or mis-subtracted foreground sources, which are further complicated by the frequency-dependent beam effects of interferometers, will generate significant fluctuations along the frequency dimension. To address this issue, we propose a novel deep-learning-based method that uses a nine-layer convolutional denoising autoencoder (CDAE) to separate the EoR signal. After being trained on the SKA images simulated with realistic beam effects, the CDAE achieves excellent performance as the mean correlation coefficient (ρ) between the reconstructed and input EoR signals reaches 0.929 ± 0.045. In comparison, the two representative traditional methods, namely the polynomial fitting method and the continuous wavelet transform method, both have difficulties in modelling and removing the foreground emission complicated with the beam effects, yielding onlyρ poly = 0.296 ± 0.121 andρ cwt = 0.198 ± 0.160, respectively. We conclude that, by hierarchically learning sophisticated features through multiple convolutional layers, the CDAE is a powerful tool that can be used to overcome the complicated beam effects and accurately separate the EoR signal. Our results also exhibit the great potential of deep-learning-based methods in future EoR experiments.

show abstract

Single channel audio source separation using convolutional denoising autoencoders

Cited by 87 publications

References 32 publications

Universal Sound Separation

Universal Sound Separation

Source Separation with Weakly Labelled Data: an Approach to Computational Auditory Scene Analysis

Separating the EoR signal with a convolutional denoising autoencoder: a deep-learning-based method

Contact Info

Product

Resources

About