Location-Based Training for Multi-Channel Talker-Independent Speaker Separation

Taherian, Hassan; Tan, Kok Choon; Wang, DeLiang

doi:10.1109/icassp43922.2022.9747141

Cited by 7 publications

(2 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, there has been a lot of exploration in the field of multi-party meetings scenarios [1,2,3,4,5]. Progress has also been advanced with several challenges [6,7,8,9,10,11] and datasets [12,13,14,15,16] specifically focusing on this field.…”

Section: Introductionmentioning

confidence: 99%

The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge

Liang

Chen

et al. 2022

2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)

View full text Add to dashboard Cite

The recently proposed serialized output training (SOT) simplifies multi-talker automatic speech recognition (ASR) by generating speaker transcriptions separated by a special token. However, frequent speaker changes can make speaker change prediction difficult. To address this, we propose boundaryaware serialized output training (BA-SOT), which explicitly incorporates boundary knowledge into the decoder via a speaker change detection task and boundary constraint loss. We also introduce a two-stage connectionist temporal classification (CTC) strategy that incorporates token-level SOT CTC to restore temporal context information. Besides typical character error rate (CER), we introduce utterance-dependent character error rate (UD-CER) to further measure the precision of speaker change prediction. Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a pre-trained ASR model for BA-SOT model initialization further reduces CER/UD-CER by 8.4%/19.9%.

show abstract

Section: Introductionmentioning

confidence: 99%

The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge

Liang

Chen

et al. 2022

2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)

View full text Add to dashboard Cite

show abstract

“…The prior studies have laid the foundation for recent progress. With the advent of deep learning, speech separation has seen major progress [18]- [28], even with different speaker overlapping ratios [29]- [37]. In a neural architecture, multiple speaker streams compete and segregate either with a masking or regression mechanism.…”

mentioning

confidence: 99%

USEV: Universal Speaker Extraction With Visual Cue

Pan

2022

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

A speaker extraction algorithm seeks to extract the target speaker's speech from a multi-talker speech mixture. The prior studies focus mostly on speaker extraction from a highly overlapped multi-talker speech mixture. However, the targetinterference speaker overlapping ratios could vary over a wide range from 0% to 100% in natural speech communication, furthermore, the target speaker could be absent in the speech mixture, the speech mixtures in such universal multi-talker scenarios are described as general speech mixtures. The speaker extraction algorithm requires an auxiliary reference, such as a video recording or a pre-recorded speech, to form top-down auditory attention on the target speaker. We advocate that a visual cue, i.e., lip movement, is more informative than an audio cue, i.e., pre-recorded speech, to serve as the auxiliary reference for speaker extraction in disentangling the target speaker from a general speech mixture. In this paper, we propose a universal speaker extraction network with a visual cue, that works for all multi-talker scenarios. In addition, we propose a scenarioaware differentiated loss function for network training, to balance the network performance over different target-interference speaker pairing scenarios. The experimental results show that our proposed method outperforms various competitive baselines for general speech mixtures in terms of signal fidelity.

show abstract