The Volcspeech System for the ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge

Shen, Chen; Liu, Yi; Fan, Wenzhi; Wang, Bin; Wen, Shixue; Tian, Ye; Zhang, Jun; Yang, Jian; Ma, Zejun

doi:10.1109/icassp43922.2022.9747381

Cited by 4 publications

(2 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Secondly, by executing the VarArray processing on the microphone array device, we can limit the deviceto-ASR-server data transmission needs to only two audio signals, which provides a large practical benefit. Thirdly, training our ASR model is much simpler than other multi-channel multi-talker ASR models as the latter ones require simulating multi-channel training data with realistic phase information [19,26,27] while our model does not. Finally, our ASR model can be easily fine-tuned by using the VarArray outputs for real multi-channel recordings and the corresponding time-annotated reference transcriptions.…”

Section: T-sot-mentioning

confidence: 99%

“…These methods can produce highly accurate transcriptions by modeling the multi-talker multi-turn speech signals effectively in terms of both the acoustic and linguistic aspects [25]. However, most studies were conducted with monaural audio, and the existing multi-channel-based studies employed modules that are only applicable for offline inference [19,26,27]. Also, less considerate multi-channel extensions of the ASR models could suffer from the high data transmission cost from the microphone array device to the ASR server [28].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

VarArray Meets t-SOT: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition

Kanda¹,

Wu²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper presents a novel streaming automatic speech recognition (ASR) framework for multi-talker overlapping speech captured by a distant microphone array with an arbitrary geometry. Our framework, named t-SOT-VA, capitalizes on independently developed two recent technologies; array-geometry-agnostic continuous speech separation, or VarArray, and streaming multi-talker ASR based on token-level serialized output training (t-SOT). To combine the best of both technologies, we newly design a t-SOT-based ASR model that generates a serialized multi-talker transcription based on two separated speech signals from VarArray. We also propose a pre-training scheme for such an ASR model where we simulate VarArray's output signals based on monaural single-talker ASR training data. Conversation transcription experiments using the AMI meeting corpus show that the system based on the proposed framework significantly outperforms conventional ones. Our system achieves the state-of-the-art word error rates of 13.7% and 15.5% for the AMI development and evaluation sets, respectively, in the multiple-distant-microphone setting while retaining the streaming inference capability.

show abstract

Section: T-sot-mentioning

confidence: 99%