ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9746074
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Turn RNN-T for Streaming Recognition of Multi-Party Speech

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
8

Relationship

1
7

Authors

Journals

citations
Cited by 11 publications
(8 citation statements)
references
References 27 publications
0
8
0
Order By: Relevance
“…There is a large amount of recent published work investigating the important issue of generalizing audio-only M-T approaches to scenarios with a larger number of speakers and more arbitrary turntaking [23,9,24,25]. However, we also found it important to maintain accuracy for the multi-talker models on both overlapping speech as well as single speaker utterances.…”
Section: Simulated Audio-visual Overlapping Speech Corporamentioning
confidence: 79%
“…There is a large amount of recent published work investigating the important issue of generalizing audio-only M-T approaches to scenarios with a larger number of speakers and more arbitrary turntaking [23,9,24,25]. However, we also found it important to maintain accuracy for the multi-talker models on both overlapping speech as well as single speaker utterances.…”
Section: Simulated Audio-visual Overlapping Speech Corporamentioning
confidence: 79%
“…The ORC WER [2] is a special case of the MIMO WER which additionally keeps the temporal order across speakers intact. It can be computed with Eq.…”
Section: Optimal Reference Combination Wer (Orc Wer)mentioning
confidence: 99%
“…The cpWER is available in the Kaldi speech recognition toolkit [1], but not easily accessible. WER metrics that emerged recently, such as the ORC WER [2] or MIMO WER [3], have no published implementation outside of MeetEval 2 .…”
Section: Introductionmentioning
confidence: 99%
“…A separately trained ASR system can then be used to transcribe each segment found by speaker diarisation, and obtain speaker-attributed ASR output over long audio streams [2,3]. Recently, end-to-end methods have been proposed for jointly modelling some modules in a speaker diarisation pipeline with an ASR system [4][5][6][7][8][9][10][11][12].…”
Section: Introductionmentioning
confidence: 99%