ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054172
|View full text |Cite
|
Sign up to set email alerts
|

Two-Step Sound Source Separation: Training On Learned Latent Targets

Abstract: In this paper, we propose a two-step training procedure for source separation via a deep neural network. In the first step we learn a transform (and it's inverse) to a latent space where masking-based separation performance using oracles is optimal. For the second step, we train a separation module that operates on the previously learned space. In order to do so, we also make use of a scale-invariant signal to distortion ratio (SI-SDR) loss function that works in the latent space, and we prove that it lower-bo… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
49
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 57 publications
(52 citation statements)
references
References 24 publications
1
49
0
Order By: Relevance
“…We trained all models with 4 sec speech segments. For the models trained with SI-SNR, we pre-processed the target signals by variance normalization using the standard deviation of the mixture as in [23]. As a separator for Conv-TasNet models we used the TCN version by Tzinis et al [23].…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…We trained all models with 4 sec speech segments. For the models trained with SI-SNR, we pre-processed the target signals by variance normalization using the standard deviation of the mixture as in [23]. As a separator for Conv-TasNet models we used the TCN version by Tzinis et al [23].…”
Section: Methodsmentioning
confidence: 99%
“…For the models trained with SI-SNR, we pre-processed the target signals by variance normalization using the standard deviation of the mixture as in [23]. As a separator for Conv-TasNet models we used the TCN version by Tzinis et al [23]. We used the ADAM optimizer with a learning rate of 1e-3, and divided the learning rate by 2 after 5 consecutive epochs with no reduction in validation loss.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Zadeh et al [3] constructed a small (less than 1 hour) dataset with 25 sound classes and proposed a transformerbased model to separate a fixed number of sources. Tzinis et al [4] performed separation experiments with a fixed number of sources on the 50-class ESC-50 dataset [5]. Other papers have leveraged information about sound class, either as conditioning information or as as a weak supervision signal [6,2,7].…”
Section: Relation To Prior Workmentioning
confidence: 99%
“…Some models work in the time-frequency domain: DPCL++ [7], uPIT-BLSTM-ST [12] and Chimera++ [36]. Some models work in the time-domain: BLSTM-TasNet [21], Conv-TasNet [22], Two-Step TDCN [32], MSGT-TasNet [41], SuDoRM-RF [33], DualPathRNN [20], Sepformer [31] and Gated DualPathRNN [23]. SuDoRM-RF has four variants which are labeled by appending 0.25x, 0.5x, 1.0x and 2.5x to the end of the name, indicating the variants consist of 4, 8, 16 and 40 blocks, respectively.…”
Section: Comparison With Existing Modelsmentioning
confidence: 99%