ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9052981
|View full text |Cite
|
Sign up to set email alerts
|

Demystifying TasNet: A Dissecting Approach

Abstract: In recent years time domain speech separation has excelled over frequency domain separation in single channel scenarios and noise-free environments. In this paper we dissect the gains of the time-domain audio separation network (TasNet) approach by gradually replacing components of an utterance-level permutation invariant training (u-PIT) based separation system in the frequency domain until the TasNet system is reached, thus blending components of frequency domain approaches with those of time domain approach… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
27
1

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 49 publications
(29 citation statements)
references
References 20 publications
1
27
1
Order By: Relevance
“…The SDR equation is expressed as follows: where ∥·∥ denotes the -norm function and is the near-end speech at the first microphone used in this study. Second, the logarithmic mean squared error (LMSE) [ 32 ] is defined as an additional loss in the latent domain to reduce the error of the echo and noise repeatedly estimated by the TCN blocks in each tower. where d , n , , and are the latent features of the target echo, the latent features of the target noise, the P -th TCN output of the echo tower, and the P -th TCN output of the noise tower, respectively.…”
Section: Proposed Multi-channel Cross-tower With Attention Mechanimentioning
confidence: 99%
See 1 more Smart Citation
“…The SDR equation is expressed as follows: where ∥·∥ denotes the -norm function and is the near-end speech at the first microphone used in this study. Second, the logarithmic mean squared error (LMSE) [ 32 ] is defined as an additional loss in the latent domain to reduce the error of the echo and noise repeatedly estimated by the TCN blocks in each tower. where d , n , , and are the latent features of the target echo, the latent features of the target noise, the P -th TCN output of the echo tower, and the P -th TCN output of the noise tower, respectively.…”
Section: Proposed Multi-channel Cross-tower With Attention Mechanimentioning
confidence: 99%
“…where • denotes the 2 -norm function and S target is the near-end speech at the first microphone used in this study. Second, the logarithmic mean squared error (LMSE) [32] is defined as an additional loss in the latent domain to reduce the error of the echo and noise repeatedly estimated by the TCN blocks in each tower. LMSE P = 10 log 10 (|d −d P | 2 ) + 10 log 10 (|n −n P | 2 ),…”
Section: Training Objectivementioning
confidence: 99%
“…More recently, a convolutional time-domain audio separation network (Conv-TasNet) [4] has been proposed and achieved significant separation performance improvement over those time-frequency based techniques. This Conv-TasNet has attracted widespread attention and further improved in many recent works, either for the single-channel or multi-channel speech separation tasks [5][6][7].…”
Section: Introductionmentioning
confidence: 99%
“…While earlier publications, such as permutation invariant training (PIT) [8], deep clustering [9] and variants thereof [10], employed neural network training criteria that were defined in the Short Time Fourier Transform (STFT) domain, more recent publications suggest that loss functions defined in the time-domain, such as the (scale invariant) Signal-to-Distortion Ratio (SDR), generally achieve superior separation performance [11,12]. In fact, the investigation in [13] showed that the advantage of time-domain loss functions is maintained even if the mask estimation is actually carried out in the frequency domain. However, the combination of time-domain NN training criteria and source extraction by beamforming at training time is widely unexplored, and will be the focus of this work.…”
Section: Introductionmentioning
confidence: 99%