Summary on the ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge

Yu, Fan; Zhang, Shiliang; Guo, Pengcheng; Fu, Yihui; Du, Zhihao; Zheng, Siqi; Huang, Weilong; Xie, Lei; Wang, DeLiang; Ye, Qian; Lee, Kong Aik; Yan, Zhijie; Ma, Bin; Xu, Xin; Bu, Hui

doi:10.1109/icassp43922.2022.9746270

Cited by 16 publications

(2 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another well-known framework is target-speaker voice activity detection (TS-VAD) [11], it estimates voice activity of all speakers at the same time with the help of their speaker embeddings. TS-VAD has shown promising performance in many tasks, such as CHiME-6 [2], DIRHARD-III [4], and AliMeeting [12], etc.…”

Section: Introductionmentioning

confidence: 99%

DiaCorrect: End-to-end error correction for speaker diarization

Han¹,

Cao²,

Lu³

et al. 2022

Preprint

View full text Add to dashboard Cite

In recent years, speaker diarization has attracted widespread attention. To achieve better performance, some studies propose to diarize speech in multiple stages. Although these methods might bring additional benefits, most of them are quite complex. Motivated by spelling correction in automatic speech recognition (ASR), in this paper, we propose an end-to-end error correction framework, termed DiaCorrect, to refine the initial diarization results in a simple but efficient way. By exploiting the acoustic interactions between input mixture and its corresponding speaker activity, DiaCorrect could automatically adapt the initial speaker activity to minimize the diarization errors. Without bells and whistles, experiments on LibriSpeech based 2-speaker meeting-like data show that, the selfattentitive end-to-end neural diarization (SA-EEND) baseline with DiaCorrect could reduce its diarization error rate (DER) by over 62.4% from 12.31% to 4.63%. Our source code is available online at https://github.com/jyhan03/diacorrect.

show abstract

Section: Introductionmentioning

confidence: 99%

DiaCorrect: End-to-end error correction for speaker diarization

Han¹,

Cao²,

Lu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Recently, there has been a lot of exploration in the field of multi-party meetings scenarios [1,2,3,4,5]. Progress has also been advanced with several challenges [6,7,8,9,10,11] and datasets [12,13,14,15,16] specifically focusing on this field. One major problem of this scenario is the speech overlap.…”

Section: Introductionmentioning

confidence: 99%

The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge

Liang

Chen

et al. 2022

2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)

Self Cite

View full text Add to dashboard Cite

The recently proposed serialized output training (SOT) simplifies multi-talker automatic speech recognition (ASR) by generating speaker transcriptions separated by a special token. However, frequent speaker changes can make speaker change prediction difficult. To address this, we propose boundaryaware serialized output training (BA-SOT), which explicitly incorporates boundary knowledge into the decoder via a speaker change detection task and boundary constraint loss. We also introduce a two-stage connectionist temporal classification (CTC) strategy that incorporates token-level SOT CTC to restore temporal context information. Besides typical character error rate (CER), we introduce utterance-dependent character error rate (UD-CER) to further measure the precision of speaker change prediction. Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a pre-trained ASR model for BA-SOT model initialization further reduces CER/UD-CER by 8.4%/19.9%.

show abstract