2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018
DOI: 10.1109/icassp.2018.8462116
|View full text |Cite
|
Sign up to set email alerts
|

TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation

Abstract: Robust speech processing in multi-talker environments requires effective speech separation. Recent deep learning systems have made significant progress toward solving this problem, yet it remains challenging particularly in real-time, short latency applications. Most methods attempt to construct a mask for each source in time-frequency representation of the mixture signal which is not necessarily an optimal representation for speech separation. In addition, time-frequency decomposition results in inherent prob… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
374
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 485 publications
(374 citation statements)
references
References 24 publications
(45 reference statements)
0
374
0
Order By: Relevance
“…We consider the update that does not modify the phase of s j (i.e., with a '+' sign in (12)), as it corresponds to the equality case of Eq. (7). Under such an update, ψ(s) = ψ + (s, y), which shows that ψ + is an auxiliary function for ψ.…”
Section: B Auxiliary Functionmentioning
confidence: 85%
“…We consider the update that does not modify the phase of s j (i.e., with a '+' sign in (12)), as it corresponds to the equality case of Eq. (7). Under such an update, ψ(s) = ψ + (s, y), which shows that ψ + is an auxiliary function for ψ.…”
Section: B Auxiliary Functionmentioning
confidence: 85%
“…The dataset is the same as the two-talker mixed dataset in [3], [4], [6], [7], [8], except that the sample rate is 16 kHz. It is generated by mixing the utterances in WSJ0 corpus at various signal-to-noise ratios uniformly chosen between 0 dB and 5 dB, and has 20k, 5k and 3k mixtures for training, validation and testing respectively.…”
Section: A Experimental Setupmentioning
confidence: 99%
“…Index Terms: multi-talker speech separation, permutation invariant training, latency-controlled BLSTM, speaker tracing I. INTRODUCTION Many advancements have been observed for monaural multi-talker speech separation [1], [2], [3], [4], [5], [6], [7], [8], [9], known as cocktail party problem [10], which is meaningful to many practical applications, such as humanmachine interaction, automatic meeting transcription etc. With the development of deep learning [11], a lot of innovations have been proposed, such as deep clustering [3], [4], deep attractor network [5], time-domain audio separation network [6], [9] and permutation invariant training (PIT) [7], [8].…”
mentioning
confidence: 99%
“…The topic of single-channel source separation has been examined extensively over the last few years, trying to solve the cocktail party problem with techniques such as Deep Clustering (DPCL) [3], Permutation Invariant Training (PIT) [4] and TasNet [5,6]. In DPCL, a neural network is trained to map each time-frequency bin to an embedding vector in a way that embedding vectors of the same speaker form a cluster in the embedding space.…”
Section: Introductionmentioning
confidence: 99%