Two-Step Sound Source Separation: Training On Learned Latent Targets

Tzinis, Efthymios; Venkataramani, Shrikant; Wang, Zhepei; Subakan, Cem; Smaragdis, Paris

doi:10.1109/icassp40776.2020.9054172

Cited by 57 publications

(52 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We trained all models with 4 sec speech segments. For the models trained with SI-SNR, we pre-processed the target signals by variance normalization using the standard deviation of the mixture as in [23]. As a separator for Conv-TasNet models we used the TCN version by Tzinis et al [23].…”

Section: Methodsmentioning

confidence: 99%

“…For the models trained with SI-SNR, we pre-processed the target signals by variance normalization using the standard deviation of the mixture as in [23]. As a separator for Conv-TasNet models we used the TCN version by Tzinis et al [23]. We used the ADAM optimizer with a learning rate of 1e-3, and divided the learning rate by 2 after 5 consecutive epochs with no reduction in validation loss.…”

Section: Methodsmentioning

confidence: 99%

“…Hence, no permutation ambiguity is introduced, which is important for Sc to become targets when later training the sep- arator. The idea of training the separator separately from the encoder/decoder was first introduced in [23]. Our contribution is to use it for setting a visible target for tPIT-latent.…”

Section: Tpit Stepmentioning

confidence: 99%

See 2 more Smart Citations

On Permutation Invariant Training For Speech Source Separation

Liu

Pons

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We study permutation invariant training (PIT), which targets at the permutation ambiguity problem for speaker independent source separation models. We extend two state-of-the-art PIT strategies. First, we look at the two-stage speaker separation and tracking algorithm based on frame level PIT (tPIT) and clustering, which was originally proposed for the STFT domain, and we adapt it to work with waveforms and over a learned latent space. Further, we propose an efficient clustering loss scalable to waveform models. Second, we extend a recently proposed auxiliary speaker-ID loss with a deep feature loss based on "problem agnostic speech features", to reduce the local permutation errors made by the utterance level PIT (uPIT). Our results show that the proposed extensions help reducing permutation ambiguity. However, we also note that the studied STFTbased models are more effective at reducing permutation errors than waveform-based models, a perspective overlooked in recent studies.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

On Permutation Invariant Training For Speech Source Separation

Liu

Pons

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Zadeh et al [3] constructed a small (less than 1 hour) dataset with 25 sound classes and proposed a transformerbased model to separate a fixed number of sources. Tzinis et al [4] performed separation experiments with a fixed number of sources on the 50-class ESC-50 dataset [5]. Other papers have leveraged information about sound class, either as conditioning information or as as a weak supervision signal [6,2,7].…”

Section: Relation To Prior Workmentioning

confidence: 99%

What’s all the Fuss about Free Universal Sound Separation Data?

Wisdom

Erdoğan

Ellis

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number of sounds from an open domain of sound types. The dataset consists of 23 hours of single-source audio data drawn from 357 classes, which are used to create mixtures of one to four sources. To simulate reverberation, an acoustic room simulator is used to generate impulse responses of box-shaped rooms with frequencydependent reflective walls. Additional open-source data augmentation tools are also provided to produce new mixtures with different combinations of sources and room simulations. Finally, we introduce an open-source baseline separation model, based on an improved time-domain convolutional network (TDCN++), that can separate a variable number of sources in a mixture. This model achieves 9.8 dB of scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources, while reconstructing single-source inputs with 35.8 dB absolute SI-SNR. We hope this dataset will lower the barrier to new research and allow for fast iteration and application of novel techniques from other machine learning domains to the sound separation challenge.

show abstract

“…Some models work in the time-frequency domain: DPCL++ [7], uPIT-BLSTM-ST [12] and Chimera++ [36]. Some models work in the time-domain: BLSTM-TasNet [21], Conv-TasNet [22], Two-Step TDCN [32], MSGT-TasNet [41], SuDoRM-RF [33], DualPathRNN [20], Sepformer [31] and Gated DualPathRNN [23]. SuDoRM-RF has four variants which are labeled by appending 0.25x, 0.5x, 1.0x and 2.5x to the end of the name, indicating the variants consist of 4, 8, 16 and 40 blocks, respectively.…”

Section: Comparison With Existing Modelsmentioning

confidence: 99%

Speech Separation Using an Asynchronous Fully Recurrent Convolutional Neural Network

Hu¹,

Li²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recent advances in the design of neural network architectures, in particular those specialized in modeling sequences, have provided significant improvements in speech separation performance. In this work, we propose to use a bio-inspired architecture called Fully Recurrent Convolutional Neural Network (FRCNN) to solve the separation task. This model contains bottom-up, top-down and lateral connections to fuse information processed at various time-scales represented by stages. In contrast to the traditional approach updating stages in parallel, we propose to first update the stages one by one in the bottom-up direction, then fuse information from adjacent stages simultaneously and finally fuse information from all stages to the bottom stage together. Experiments showed that this asynchronous updating scheme achieved significantly better results with much fewer parameters than the traditional synchronous updating scheme. In addition, the proposed model achieved good balance between speech separation accuracy and computational efficiency as compared to other state-of-the-art models on three benchmark datasets.

show abstract

Two-Step Sound Source Separation: Training On Learned Latent Targets

Cited by 57 publications

References 24 publications

On Permutation Invariant Training For Speech Source Separation

On Permutation Invariant Training For Speech Source Separation

What’s all the Fuss about Free Universal Sound Separation Data?

Speech Separation Using an Asynchronous Fully Recurrent Convolutional Neural Network

Contact Info

Product

Resources

About