CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

Watanabe, Shinji; Mandel, Michael I.; Barker, Jon; Vincent, Emmanuel; Arora, Ashish; Chang, Xuankai; Khudanpur, Sanjeev; Manohar, Vimal; Povey, Daniel; Raj, Desh; Snyder, David; Subramanian, Arvind; Trmal, Jan; Yair, Bar Ben; Boeddeker, Christoph; Ni, Zhaoheng; Fujita, Yusuke; Horiguchi, Shota; Kanda, Naoyuki; Yoshioka, Takuya; Ryant, Neville

doi:10.48550/arxiv.2004.09249

Cited by 50 publications

(39 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, speech data is notoriously difficult to work with for machine learning practitioners. Recordings of speech come in many flavors: as isolated utterances in separate files (e.g., LibriSpeech [13]); long, continuous recordings of podcasts and conversations (e.g., GigaSpeech [7]); or even multi-channel recordings from multiple microphone arrays (e.g., AMI [10], CHiME-6 [18]). Audio is encoded with a variety of codecs, both common (e.g., PCM, FLAC, OPUS) and obscure (e.g., sphere, shorten).…”

Section: Introductionmentioning

confidence: 99%

Lhotse: a speech data representation library for the modern deep learning ecosystem

Żelasko¹,

Povey²,

Trmal³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Speech data is notoriously difficult to work with due to a variety of codecs, lengths of recordings, and meta-data formats. We present Lhotse, a speech data representation library that draws upon lessons learned from Kaldi speech recognition toolkit and brings its concepts into the modern deep learning ecosystem. Lhotse provides a common JSON description format with corresponding Python classes and data preparation recipes for over 30 popular speech corpora. Various datasets can be easily combined together and re-purposed for different tasks. The library handles multi-channel recordings, long recordings, local and cloud storage, lazy and on-the-fly operations amongst other features. We introduce Cut and CutSet concepts, which simplify common data wrangling tasks for audio and help incorporate acoustic context of speech utterances. Finally, we show how Lhotse leverages PyTorch data API abstractions and adopts them to handle speech data for deep learning.

show abstract

Section: Introductionmentioning

confidence: 99%

Lhotse: a speech data representation library for the modern deep learning ecosystem

Żelasko¹,

Povey²,

Trmal³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Multi-talker speech recognition is focused on recognizing individual speech sources from overlap speech, and is one main challenge for current ASR systems [1,2,3,4,5,6,7,8]. Current solutions for multi-speaker speech recognition can be categorized into two main approaches: (i) performing frontend speech processing based on separation on the overlap speech, then applying ASR to the separated speech signals [9,10,11,12,13,14,15]; or (ii) skipping the explicit separation step and developing a multi-speaker speech recognition system directly using either hybrid [16, 17, ?…”

Section: Introductionmentioning

confidence: 99%

Speaker conditioning of acoustic models using affine transformation for multi-speaker speech recognition

Yousefi¹

2021

Preprint

View full text Add to dashboard Cite

This study addresses the problem of single-channel Automatic Speech Recognition of a target speaker within an overlap speech scenario. In the proposed method, the hidden representations in the acoustic model are modulated by speaker auxiliary information to recognize only the desired speaker. Affine transformation layers are inserted into the acoustic model network to integrate speaker information with the acoustic features. The speaker conditioning process allows the acoustic model to perform computation in the context of target-speaker auxiliary information. The proposed speaker conditioning method is a general approach and can be applied to any acoustic model architecture. Here, we employ speaker conditioning on a ResNet acoustic model. Experiments on the WSJ corpus show that the proposed speaker conditioning method is an effective solution to fuse speaker auxiliary information with acoustic features for multi-speaker speech recognition, achieving +9% and +20% relative WER reduction for clean and overlap speech scenarios, respectively, compared to the original ResNet acoustic model baseline.

show abstract

“…However, current end-to-end approaches have been reported to be strongly overfitted to the environments that they are trained for, not generalising to diverse real-world conditions. Therefore, the winning entries to recent diarisation challenges [9][10][11] are based on the former method, and this will also be the focus of this paper.…”

Section: Introductionmentioning

confidence: 99%

Disentangled dimensionality reduction for noise-robust speaker diarisation

Kim¹,

Heo²,

Jung³

et al. 2021

Preprint

View full text Add to dashboard Cite

The objective of this work is to train noise-robust speaker embeddings for speaker diarisation. Speaker embeddings play a crucial role in the performance of diarisation systems, but they often capture spurious information such as noise and reverberation, adversely affecting performance. Our previous work have proposed an autoencoder-based dimensionality reduction module to help remove the spurious information. However, they do not explicitly separate such information and have also been found to be sensitive to hyperparameter values. To this end, we propose two contributions to overcome these issues: (i) a novel dimensionality reduction framework that can disentangle spurious information from the speaker embeddings; (ii) the use of a speech/non-speech indicator to prevent the speaker code from learning from the background noise. Through a range of experiments conducted on four different datasets, our approach consistently demonstrates the state-of-the-art performance among models that do not adopt ensembles.

show abstract

CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

Cited by 50 publications

References 35 publications

Lhotse: a speech data representation library for the modern deep learning ecosystem

Lhotse: a speech data representation library for the modern deep learning ecosystem

Speaker conditioning of acoustic models using affine transformation for multi-speaker speech recognition

Disentangled dimensionality reduction for noise-robust speaker diarisation

Contact Info

Product

Resources

About