Multi-Turn RNN-T for Streaming Recognition of Multi-Party Speech

Sklyar, Ilya; Piunova, Anna; Zheng, Xianrui; Liu, Yulan

doi:10.1109/icassp43922.2022.9746074

Cited by 11 publications

(8 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There is a large amount of recent published work investigating the important issue of generalizing audio-only M-T approaches to scenarios with a larger number of speakers and more arbitrary turntaking [23,9,24,25]. However, we also found it important to maintain accuracy for the multi-talker models on both overlapping speech as well as single speaker utterances.…”

Section: Simulated Audio-visual Overlapping Speech Corporamentioning

confidence: 79%

End-to-end multi-talker audio-visual ASR using an active speaker attention module

Rose¹,

Siohan²

2022

Preprint

View full text Add to dashboard Cite

This paper presents a new approach for end-to-end audio-visual multi-talker speech recognition. The approach, referred to here as the visual context attention model (VCAM), is important because it uses the available video information to assign decoded text to one of multiple visible faces. This essentially resolves the label ambiguity issue associated with most multi-talker modeling approaches which can decode multiple label strings but cannot assign the label strings to the correct speakers. This is implemented as a transformer-transducer based end-to-end model and evaluated using a two speaker audio-visual overlapping speech dataset created from YouTube videos. It is shown in the paper that the VCAM model improves performance with respect to previously reported audio-only and audio-visual multi-talker ASR systems.

show abstract

Section: Simulated Audio-visual Overlapping Speech Corporamentioning

confidence: 79%

End-to-end multi-talker audio-visual ASR using an active speaker attention module

Rose¹,

Siohan²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The ORC WER [2] is a special case of the MIMO WER which additionally keeps the temporal order across speakers intact. It can be computed with Eq.…”

Section: Optimal Reference Combination Wer (Orc Wer)mentioning

confidence: 99%

“…The cpWER is available in the Kaldi speech recognition toolkit [1], but not easily accessible. WER metrics that emerged recently, such as the ORC WER [2] or MIMO WER [3], have no published implementation outside of MeetEval 2 .…”

Section: Introductionmentioning

confidence: 99%

On Word Error Rate Definitions and Their Efficient Computation for Multi-Speaker Speech Recognition Systems

Neumann

Boeddeker

Kinoshita

et al. 2023

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

MeetEval is an open-source toolkit to evaluate all kinds of meeting transcription systems. It provides a unified interface for the computation of commonly used Word Error Rates (WERs), specifically cpWER, ORC WER and MIMO WER along other WER definitions. We extend the cpWER computation by a temporal constraint to ensure that only words are identified as correct when the temporal alignment is plausible. This leads to a better quality of the matching of the hypothesis string to the reference string that more closely resembles the actual transcription quality, and a system is penalized if it provides poor time annotations. Since word-level timing information is often not available, we present a way to approximate exact wordlevel timings from segment-level timings (e.g., a sentence) and show that the approximation leads to a similar WER as a matching with exact word-level annotations. At the same time, the time constraint leads to a speedup of the matching algorithm, which outweighs the additional overhead caused by processing the time stamps.

show abstract

“…A separately trained ASR system can then be used to transcribe each segment found by speaker diarisation, and obtain speaker-attributed ASR output over long audio streams [2,3]. Recently, end-to-end methods have been proposed for jointly modelling some modules in a speaker diarisation pipeline with an ASR system [4][5][6][7][8][9][10][11][12].…”

Section: Introductionmentioning

confidence: 99%

Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

Zheng¹,

Zhang²,

Woodland³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Self-supervised-learning-based pre-trained models for speech data, such as Wav2Vec 2.0 (W2V2), have become the backbone of many speech tasks. In this paper, to achieve speaker diarisation and speech recognition using a single model, a tandem multitask training (TMT) method is proposed to fine-tune W2V2. For speaker diarisation, the tasks of voice activity detection (VAD) and speaker classification (SC) are required, and connectionist temporal classification (CTC) is used for ASR. The multitask framework implements VAD, SC, and ASR using an early layer, middle layer, and late layer of W2V2, which coincides with the order of segmenting the audio with VAD, clustering the segments based on speaker embeddings, and transcribing each segment with ASR. Experimental results on the augmented multi-party (AMI) dataset showed that using different W2V2 layers for VAD, SC, and ASR from the earlier to later layers for TMT not only saves computational cost, but also reduces diarisation error rates (DERs). Joint fine-tuning of VAD, SC, and ASR yielded 16%/17% relative reductions of DER with manual/automatic segmentation respectively, and consistent reductions in speaker attributed word error rate, compared to the baseline with separately fine-tuned models.

show abstract

Multi-Turn RNN-T for Streaming Recognition of Multi-Party Speech

Cited by 11 publications

References 27 publications

End-to-end multi-talker audio-visual ASR using an active speaker attention module

End-to-end multi-talker audio-visual ASR using an active speaker attention module

On Word Error Rate Definitions and Their Efficient Computation for Multi-Speaker Speech Recognition Systems

Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

Contact Info

Product

Resources

About