Deep Casa for Talker-independent Monaural Speech Separation

Liu, Yuzhou; Delfarah, Masood; Wang, DeLiang

doi:10.1109/icassp40776.2020.9054572

Cited by 17 publications

(15 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, a speech enhancement network is used on top of the separation model to further reduce WER. Our focus in this study is on the separation model, and we can expect further improvement by introducing speech enhancement in future work (see [25]).…”

Section: Evaluation Resultsmentioning

confidence: 99%

Time-Domain Loss Modulation Based on Overlap Ratio for Monaural Conversational Speaker Separation

Taherian

Wang

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Existing speaker separation methods deliver excellent performance on fully overlapped signal mixtures. To apply these methods in daily conversations that include occasional concurrent speakers, recent studies incorporate both overlapped and non-overlapped segments in the training data. However, such training data can degrade the separation performance due to triviality of non-overlapped segments where the model reflects the input to the output. We propose a new loss function for speaker separation based on permutation invariant training that dynamically reweighs losses using the segment overlap ratio. The new loss function emphasizes overlapped regions while deemphasizing the segments with single speakers. We demonstrate the effectiveness of the proposed loss function on an automatic speech recognition (ASR) task. Experiments on the recently introduced LibriCSS corpus show that our proposed single-channel method produces consistent improvements compared to baseline methods.

show abstract

Section: Evaluation Resultsmentioning

confidence: 99%

Time-Domain Loss Modulation Based on Overlap Ratio for Monaural Conversational Speaker Separation

Taherian

Wang

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Although there are considerably low overall performances throughout the evaluation, it is important to remember that these techniques were evaluated when running in an online manner. Many of the state-of-the-art source separation techniques (mainly based on deep learning) do not run in such a way, opting to be fed full audio recordings [ 33 , 35 ]. This lends to higher performances since information in the future of the current window can also be utilized to obtain a good separation performance.…”

Section: Discussionmentioning

confidence: 99%

A Corpus-Based Evaluation of Beamforming Techniques and Phase-Based Frequency Masking

Rascón

2021

Sensors

View full text Add to dashboard Cite

Beamforming is a type of audio array processing techniques used for interference reduction, sound source localization, and as pre-processing stage for audio event classification and speaker identification. The auditory scene analysis community can benefit from a systemic evaluation and comparison between different beamforming techniques. In this paper, five popular beamforming techniques are evaluated in two different acoustic environments, while varying the number of microphones, the number of interferences, and the direction-of-arrival error, by using the Acoustic Interactions for Robot Audition (AIRA) corpus and a common software framework. Additionally, a highly efficient phase-based frequency masking beamformer is also evaluated, which is shown to outperform all five techniques. Both the evaluation corpus and the beamforming implementations are freely available and provided for experiment repeatability and transparency. Raw results are also provided as a complement to this work to the reader, to facilitate an informed decision of which technique to use. Finally, the insights and tendencies observed from the evaluation results are presented.

show abstract

“…This required a long STFT time window. This requirement increased the minimum delay of the system, which limited its applicability in real-time and low-latency applications, therefore, more and more research has begun to turn to time-domain methods [ 12 , 23 , 24 , 26 , 27 , 29 ].…”

Section: Methodsmentioning

confidence: 99%

“…With the development of big data and the improvement of computing power, deep learning achieves great success in time series signal processing such as speech recognition, speech separation [ 12 , 15 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 , 33 , 34 , 35 , 36 , 37 , 38 ], and communication signal modulation recognition [ 39 ]. These tasks demonstrate the powerful feature extraction and timing signal processing capabilities of deep learning.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Single-Channel Blind Source Separation of Spatial Aliasing Signal Based on Stacked-LSTM

Zhao

Xiujuan

Wang

et al. 2021

Sensors

View full text Add to dashboard Cite

Aiming at the problem of insufficient separation accuracy of aliased signals in space Internet satellite-ground communication scenarios, a stacked long short-term memory network (Stacked-LSTM) separation method based on deep learning is proposed. First, the coding feature representation of the mixed signal is extracted. Then, the long sequence input is divided into smaller blocks through the Stacked-LSTM network with the attention mechanism of the SE module, and the deep feature mask of the source signal is trained to obtain the Hadamard product of the mask of each source and the coding feature of the mixed signal, which is the encoding feature representation of the source signal. Finally, characteristics of the source signal is decoded by 1-D convolution to to obtain the original waveform. The negative scale-invariant source-to-noise ratio (SISNR) is used as the loss function of network training, that is, the evaluation index of single-channel blind source separation performance. The results show that in the single-channel separation of spatially aliased signals, the Stacked-LSTM method improves SISNR by 10.09∼38.17 dB compared with the two classic separation algorithms of ICA and NMF and the three deep learning separation methods of TasNet, Conv-TasNet and Wave-U-Net. The Stacked-LSTM method has better separation accuracy and noise robustness.

show abstract

Deep Casa for Talker-independent Monaural Speech Separation

Cited by 17 publications

References 24 publications

Time-Domain Loss Modulation Based on Overlap Ratio for Monaural Conversational Speaker Separation

Time-Domain Loss Modulation Based on Overlap Ratio for Monaural Conversational Speaker Separation

A Corpus-Based Evaluation of Beamforming Techniques and Phase-Based Frequency Masking

Single-Channel Blind Source Separation of Spatial Aliasing Signal Based on Stacked-LSTM

Contact Info

Product

Resources

About