A Real-Time Speaker Diarization System Based on Spatial Spectrum

Zheng, Siqi; Huang, Weilong; Wang, Xianliang; Suo, Hongbin; Jiang, Feng; Yan, Zhijie

doi:10.1109/icassp39728.2021.9413544

Cited by 17 publications

(5 citation statements)

References 17 publications

(14 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The DOA of the sound source is proved to be helpful [8]. We train a neural-net-based DOA estimator to obtain a 36-dim probability vector representing the azimuth angles that divide the space with ten-degree intervals.…”

Section: Front-end Processingmentioning

confidence: 99%

The Volcspeech system for the ICASSP 2022 multi-channel multi-party meeting transcription challenge

Chen¹,

Liu²,

Fan³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper describes our submission to ICASSP 2022 Multi-channel Multi-party Meeting Transcription (M2MeT) Challenge. For Track 1, we propose several approaches to empower the clustering-based speaker diarization system to handle overlapped speech. Front-end dereverberation and the direction-of-arrival (DOA) estimation are used to improve the accuracy of speaker diarization. Multi-channel combination and overlap detection are applied to reduce the missed speaker error. A modified DOVER-Lap is also proposed to fuse the results of different systems. We achieve the final DER of 5.79% on the Eval set and 7.23% on the Test set. For Track 2, we develop our system using the Conformer model in a joint CTC-attention architecture. Serialized output training is adopted to multi-speaker overlapped speech recognition. We propose a neural front-end module to model multi-channel audio and train the model end-to-end. Various data augmentation methods are utilized to mitigate over-fitting in the multi-channel multi-speaker E2E system. Transformer language model fusion is developed to achieve better performance. The final CER is 19.2% on the Eval set and 20.8% on the Test set.

show abstract

Section: Front-end Processingmentioning

confidence: 99%

The Volcspeech system for the ICASSP 2022 multi-channel multi-party meeting transcription challenge

Chen¹,

Liu²,

Fan³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…We release baseline systems along with the Train and Eval data for quick start and reproducible research. For the 8-channel data of AliMeeting recorded by microphone array, we select the first channel to obtain Ali-far, and adopt CDDMA beamformer [41,42] on 8channel data to generate Ali-far-bf. We use prefix Train-*, Eval-* and Test-* to denote generated data associated with Train, Eval and Test sets.…”

Section: Datasets Tracks and Baselinesmentioning

confidence: 99%

Summary On The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge

Yang¹,

Zhang²,

Guo³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge (M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech technologies. The M2MeT challenge has particularly set up two tracks, speaker diarization (track 1) and multi-speaker automatic speech recognition (ASR) (track 2). Along with the challenge, we released 120 hours of real-recorded Mandarin meeting speech data with manual annotation, including far-field data collected by 8-channel microphone array as well as near-field data collected by each participants' headset microphone. We briefly describe the released dataset, track setups, baselines and summarize the challenge results and major techniques used in the submissions.

show abstract

“…The AliMeeting corpus contains far-field overlapped audios (Ali-f ar), as well as the corresponding near-field audios (Ali-near), which only record and transcribe the speech of a single speaker. The CDDMA Beamformer [34,35] is applied to Ali-f ar to produce Ali-f ar-bf . To evaluate the performance in a single talker scenario, T est N et, and T est M eeting are adopted.…”

Section: Datasetmentioning

confidence: 99%

The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge

Liang

Chen

et al. 2022

2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)

View full text Add to dashboard Cite

The recently proposed serialized output training (SOT) simplifies multi-talker automatic speech recognition (ASR) by generating speaker transcriptions separated by a special token. However, frequent speaker changes can make speaker change prediction difficult. To address this, we propose boundaryaware serialized output training (BA-SOT), which explicitly incorporates boundary knowledge into the decoder via a speaker change detection task and boundary constraint loss. We also introduce a two-stage connectionist temporal classification (CTC) strategy that incorporates token-level SOT CTC to restore temporal context information. Besides typical character error rate (CER), we introduce utterance-dependent character error rate (UD-CER) to further measure the precision of speaker change prediction. Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a pre-trained ASR model for BA-SOT model initialization further reduces CER/UD-CER by 8.4%/19.9%.

show abstract

A Real-Time Speaker Diarization System Based on Spatial Spectrum

Cited by 17 publications

References 17 publications

The Volcspeech system for the ICASSP 2022 multi-channel multi-party meeting transcription challenge

The Volcspeech system for the ICASSP 2022 multi-channel multi-party meeting transcription challenge

Summary On The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge

The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge

Contact Info

Product

Resources

About