Multi-Channel Talker-Independent Speaker Separation Through Location-Based Training

Taherian, Hassan; Tan, Kok Choon; Wang, De Liang

doi:10.1109/taslp.2022.3202129

Cited by 15 publications

(7 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other comparison methods, i.e. MISO 1 -BF-MISO 3 [17], Convolutional Prediction [60], MC-CSM with LBT [61] and TFGridNet [30], all perform neural beamforming plus neural post-processing, and achieve much better ASR performance than the timedomain end-to-end networks. This demonstrates the advantage of combining beamforming and deep learning techniques.…”

Section: Results On Sms-wsjmentioning

confidence: 99%

Multichannel Speech Separation with Narrow-band Conformer

Quan¹,

Li²

2022

Interspeech 2022

View full text Add to dashboard Cite

This work proposes a neural network to extensively exploit spatial information for multichannel joint speech separation, denoising and dereverberation, named SpatialNet. In the short-time Fourier transform (STFT) domain, the proposed network performs end-to-end speech enhancement. It is mainly composed of interleaved narrow-band and cross-band blocks to respectively exploit narrow-band and cross-band spatial information. The narrow-band blocks process frequencies independently, and use self-attention mechanism and temporal convolutional layers to respectively perform spatial-feature-based speaker clustering and temporal smoothing/filtering. The crossband blocks processes frames independently, and use full-band linear layer and frequency convolutional layers to respectively learn the correlation between all frequencies and adjacent frequencies. Experiments are conducted on various simulated and real datasets, and the results show that 1) the proposed network achieves the state-of-the-art performance on almost all tasks; 2) the proposed network suffers little from the spectral generalization problem; and 3) the proposed network is indeed performing speaker clustering (demonstrated by attention maps).

show abstract

Section: Results On Sms-wsjmentioning

confidence: 99%

Multichannel Speech Separation with Narrow-band Conformer

Quan¹,

Li²

2022

Interspeech 2022

View full text Add to dashboard Cite

show abstract

“…In dynamically changing scenes, models trained with permutation-invariant training in static scenarios could mix up signals from speakers, i.e., the output of the speaker could switch. To avoid such switching, our approach could be combined with location-based training as in [36] or online clustering of frame wise speaker embeddings as proposed in [37].…”

Section: Effect Of Model Size and Groupingmentioning

confidence: 99%

Binaural Multichannel Blind Speaker Separation With a Causal Low-Latency and Low-Complexity Approach

Westhausen,

Meyer

2024

IEEE Open J. Signal Process.

View full text Add to dashboard Cite

In this article, we introduce a causal low-latency low-complexity approach for binaural multichannel blind speaker separation in noisy reverberant conditions. The model, referred to as Group Communication Binaural Filter and Sum Network (GCBFSnet) predicts complex filters for filter-and-sum beamforming in the time-frequency domain. We apply Group Communication (GC), i.e., latent model variables are split into groups and processed with a shared sequence model with the aim of reducing the complexity of a simple model only containing one convolutional and one recurrent module. With GC we are able to reduce the size of the model by up to 83% and the complexity up to 73% compared to the model without GC, while mostly retaining performance. Even for the smallest model configuration, GCBFSnet matches the performance of a low-complexity TasNet baseline in most metrics despite the larger size and higher number of required operations of the baseline.INDEX TERMS Binaural, low-latency, multi-channel, real-time, speaker-separation.

show abstract

“…Recently LBT was proposed to resolve the permutation ambiguity problem in multi-channel talker-independent speaker separation [11]. LBT leverages distinct spatial locations of multiple speakers in physical space and produces superior separation performance compared to PIT.…”

Section: Location-based Training For Cssmentioning

confidence: 99%

“…In our previous study, we introduced a new training criterion, named location-based training (LBT), to assign DNN outputs according to speaker locations in physical space [11]. We showed that LBT performs better than PIT for fully overlapped utterances in simulated and matched reverberant conditions.…”

Section: Introductionmentioning

confidence: 99%

Location-Based Training for Multi-Channel Talker-Independent Speaker Separation

Taherian¹,

Tan²,

Wang³

2022

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The performance of automatic speech recognition (ASR) systems severely degrades when multi-talker speech overlap occurs. In meeting environments, speech separation is typically performed to improve the robustness of ASR systems. Recently, location-based training (LBT) was proposed as a new training criterion for multi-channel talker-independent speaker separation. Assuming fixed array geometry, LBT outperforms widely-used permutation-invariant training in fully overlapped utterances and matched reverberant conditions. This paper extends LBT to conversational multi-channel speaker separation. We introduce multi-resolution LBT to estimate the complex spectrograms from low to high time and frequency resolutions. With multi-resolution LBT, convolutional kernels are assigned consistently based on speaker locations in physical space. Evaluation results show that multi-resolution LBT consistently outperforms other competitive methods on the recorded LibriCSS corpus.

show abstract

Multi-Channel Talker-Independent Speaker Separation Through Location-Based Training

Cited by 15 publications

References 46 publications

Multichannel Speech Separation with Narrow-band Conformer

Multichannel Speech Separation with Narrow-band Conformer

Binaural Multichannel Blind Speaker Separation With a Causal Low-Latency and Low-Complexity Approach

Location-Based Training for Multi-Channel Talker-Independent Speaker Separation

Contact Info

Product

Resources

About