ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9747381
|View full text |Cite
|
Sign up to set email alerts
|

The Volcspeech System for the ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(2 citation statements)
references
References 17 publications
0
2
0
Order By: Relevance
“…Secondly, by executing the VarArray processing on the microphone array device, we can limit the deviceto-ASR-server data transmission needs to only two audio signals, which provides a large practical benefit. Thirdly, training our ASR model is much simpler than other multi-channel multi-talker ASR models as the latter ones require simulating multi-channel training data with realistic phase information [19,26,27] while our model does not. Finally, our ASR model can be easily fine-tuned by using the VarArray outputs for real multi-channel recordings and the corresponding time-annotated reference transcriptions.…”
Section: T-sot-mentioning
confidence: 99%
See 1 more Smart Citation
“…Secondly, by executing the VarArray processing on the microphone array device, we can limit the deviceto-ASR-server data transmission needs to only two audio signals, which provides a large practical benefit. Thirdly, training our ASR model is much simpler than other multi-channel multi-talker ASR models as the latter ones require simulating multi-channel training data with realistic phase information [19,26,27] while our model does not. Finally, our ASR model can be easily fine-tuned by using the VarArray outputs for real multi-channel recordings and the corresponding time-annotated reference transcriptions.…”
Section: T-sot-mentioning
confidence: 99%
“…These methods can produce highly accurate transcriptions by modeling the multi-talker multi-turn speech signals effectively in terms of both the acoustic and linguistic aspects [25]. However, most studies were conducted with monaural audio, and the existing multi-channel-based studies employed modules that are only applicable for offline inference [19,26,27]. Also, less considerate multi-channel extensions of the ASR models could suffer from the high data transmission cost from the microphone array device to the ASR server [28].…”
Section: Introductionmentioning
confidence: 99%