ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054266
|View full text |Cite
|
Sign up to set email alerts
|

Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation

Abstract: Recent studies in deep learning-based speech separation have proven the superiority of time-domain approaches to conventional timefrequency-based methods. Unlike the time-frequency domain approaches, the time-domain separation systems often receive input sequences consisting of a huge number of time steps, which introduces challenges for modeling extremely long sequences. Conventional recurrent neural networks (RNNs) are not effective for modeling such long sequences due to optimization difficulties, while one… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
408
0
3

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
2

Relationship

3
6

Authors

Journals

citations
Cited by 509 publications
(412 citation statements)
references
References 42 publications
1
408
0
3
Order By: Relevance
“…The signal to distortion ratio (SDR) [22] and the scale-invariant signal-to-noise ratio (SISNR) [23] have been steadily increasing on WSJ0-2mix [1], the most widely used speech separation dataset, which indicates the consistent progress of the separation technology. An early system [1] achieved SDR improvement of 6.3 dB while [24] improved the SDR by 19.0 dB. [20] reported that, in WSJ0-2mix, separated speech signals generated by TasNet, one of the state-of-the-art separation methods, were almost indistinguishable from clean utterances.…”
Section: Introductionmentioning
confidence: 99%
“…The signal to distortion ratio (SDR) [22] and the scale-invariant signal-to-noise ratio (SISNR) [23] have been steadily increasing on WSJ0-2mix [1], the most widely used speech separation dataset, which indicates the consistent progress of the separation technology. An early system [1] achieved SDR improvement of 6.3 dB while [24] improved the SDR by 19.0 dB. [20] reported that, in WSJ0-2mix, separated speech signals generated by TasNet, one of the state-of-the-art separation methods, were almost indistinguishable from clean utterances.…”
Section: Introductionmentioning
confidence: 99%
“…For each block in the neural networks for filter estimation, e.g. each temporal convolution network (TCN) in [18] or each dual-path RNN (DPRNN) block in [19], the TAC architecture proposed in Section 2.1 is added at the output of each block. Figure 1 However, the pre-separation results at the reference microphone still cannot benefit from the TAC operation with the two-stage design.…”
Section: Fasnet Variants With Tacmentioning
confidence: 99%
“…For multichannel models, we use the four variants of FaSNet introduced in Section 2.2.2. We use DPRNN blocks [19] as shown in Figure 1 in all models, as it has shown that DPRNN was able to outperform the previously proposed temporal convolutional network (TCN) with a significantly smaller model size [19]. All models are trained to min- imize negative scale-invariant SNR (SI-SNR) [25] with utterancelevel permutation invariant training (uPIT) [26].…”
Section: Model Configurationsmentioning
confidence: 99%
“…PIT [44], [45] is a different approach to address the global speaker label ambiguity which greatly streamlined the processing scheme and led to many top-performing separation models such as TasNet [52] or DPRNN [53]. The idea is to skip the embedding representation entirely and train a network that directly estimates posterior masks from the observation.…”
Section: Pit: Permutation Invariant Trainingmentioning
confidence: 99%