TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation

Luo, Yi; Mesgarani, Nima

doi:10.1109/icassp.2018.8462116

Cited by 485 publications

(374 citation statements)

References 24 publications

(45 reference statements)

Supporting

Mentioning

374

Contrasting

Order By: Relevance

“…We consider the update that does not modify the phase of s j (i.e., with a '+' sign in (12)), as it corresponds to the equality case of Eq. (7). Under such an update, ψ(s) = ψ + (s, y), which shows that ψ + is an auxiliary function for ψ.…”

Section: B Auxiliary Functionmentioning

confidence: 85%

Online Spectrogram Inversion for Low-Latency Audio Source Separation

Magron

Virtanen

2020

IEEE Signal Process. Lett.

View full text Add to dashboard Cite

Audio source separation is usually achieved by estimating the short-time Fourier transform (STFT) magnitude of each source, and then applying a spectrogram inversion algorithm to retrieve time-domain signals. In particular, the multiple input spectrogram inversion (MISI) algorithm has been exploited successfully in several recent works. However, this algorithm suffers from two drawbacks, which we address in this paper. First, it has originally been introduced in a heuristic fashion: we propose here a rigorous optimization framework in which MISI is derived, thus proving the convergence of this algorithm. Besides, while MISI operates offline, we propose here an online version of MISI called oMISI, which is suitable for low-latency source separation, an important requirement for e.g., hearing aids applications. oMISI also allows one to use alternative phase initialization schemes exploiting the temporal structure of audio signals. Experiments conducted on a speech separation task show that oMISI performs as well as its offline counterpart, thus demonstrating its potential for real-time source separation.

show abstract

Section: B Auxiliary Functionmentioning

confidence: 85%

Online Spectrogram Inversion for Low-Latency Audio Source Separation

Magron

Virtanen

2020

IEEE Signal Process. Lett.

View full text Add to dashboard Cite

show abstract

“…The dataset is the same as the two-talker mixed dataset in [3], [4], [6], [7], [8], except that the sample rate is 16 kHz. It is generated by mixing the utterances in WSJ0 corpus at various signal-to-noise ratios uniformly chosen between 0 dB and 5 dB, and has 20k, 5k and 3k mixtures for training, validation and testing respectively.…”

Section: A Experimental Setupmentioning

confidence: 99%

“…Index Terms: multi-talker speech separation, permutation invariant training, latency-controlled BLSTM, speaker tracing I. INTRODUCTION Many advancements have been observed for monaural multi-talker speech separation [1], [2], [3], [4], [5], [6], [7], [8], [9], known as cocktail party problem [10], which is meaningful to many practical applications, such as humanmachine interaction, automatic meeting transcription etc. With the development of deep learning [11], a lot of innovations have been proposed, such as deep clustering [3], [4], deep attractor network [5], time-domain audio separation network [6], [9] and permutation invariant training (PIT) [7], [8].…”

mentioning

confidence: 99%

Utterance-level Permutation Invariant Training with Latency-controlled BLSTM for Single-channel Multi-talker Speech Separation

Huang

Cheng

Zhang

et al. 2019

2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

View full text Add to dashboard Cite

Utterance-level permutation invariant training (uPIT) has achieved promising progress on single-channel multitalker speech separation task. Long short-term memory (LSTM) and bidirectional LSTM (BLSTM) are widely used as the separation networks of uPIT, i.e. uPIT-LSTM and uPIT-BLSTM. uPIT-LSTM has lower latency but worse performance, while uPIT-BLSTM has better performance but higher latency. In this paper, we propose using latency-controlled BLSTM (LC-BLSTM) during inference to fulfill low-latency and good-performance speech separation. To find a better training strategy for BLSTMbased separation network, chunk-level PIT (cPIT) and uPIT are compared. The experimental results show that uPIT outperforms cPIT when LC-BLSTM is used during inference. It is also found that the inter-chunk speaker tracing (ST) can further improve the separation performance of uPIT-LC-BLSTM. Evaluated on the WSJ0 two-talker mixed-speech separation task, the absolute gap of signal-to-distortion ratio (SDR) between uPIT-BLSTM and uPIT-LC-BLSTM is reduced to within 0.7 dB.

show abstract

“…The topic of single-channel source separation has been examined extensively over the last few years, trying to solve the cocktail party problem with techniques such as Deep Clustering (DPCL) [3], Permutation Invariant Training (PIT) [4] and TasNet [5,6]. In DPCL, a neural network is trained to map each time-frequency bin to an embedding vector in a way that embedding vectors of the same speaker form a cluster in the embedding space.…”

Section: Introductionmentioning

confidence: 99%

End-to-End Training of Time Domain Audio Separation and Recognition

Neumann

Kinoshita

Drude

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to multispeaker speech recognition. However, up until now, state-of-theart neural network-based time domain source separation has not yet been combined with E2E speech recognition. We here demonstrate how to combine a separation module based on a Convolutional Time domain Audio Separation Network (Conv-TasNet) with an E2E speech recognizer and how to train such a model jointly by distributing it over multiple GPUs or by approximating truncated back-propagation for the convolutional front-end. To put this work into perspective and illustrate the complexity of the design space, we provide a compact overview of single-channel multi-speaker recognition systems. Our experiments show a word error rate of 11.0 % on WSJ0-2mix and indicate that our joint time domain model can yield substantial improvements over cascade DNN-HMM and monolithic E2E frequency domain systems proposed so far.

show abstract

TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation

Cited by 485 publications

References 24 publications

Online Spectrogram Inversion for Low-Latency Audio Source Separation

Online Spectrogram Inversion for Low-Latency Audio Source Separation

Utterance-level Permutation Invariant Training with Latency-controlled BLSTM for Single-channel Multi-talker Speech Separation

End-to-End Training of Time Domain Audio Separation and Recognition

Contact Info

Product

Resources

About