Guided Source Separation Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR

Kanda, Naoyuki; Boeddeker, Christoph; Heitkaemper, Jens; Fujita, Yusuke; Horiguchi, Shota; Nagamatsu, Kenji; Haeb‐Umbach, Reinhold

doi:10.48550/arxiv.1905.12230

Cited by 7 publications

(3 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Multi-array GSS [13,30] was applied to enhance target speaker speech signals. For track 1, we used oracle speech segmentations and speaker labels, while for track 2, we used the segmentation estimated by the speaker diarization module described in Section 3.…”

Section: Guided Source Separation (Gss)mentioning

confidence: 99%

The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge

Arora

Raj

Subramanian

et al. 2020

6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020)

View full text Add to dashboard Cite

This paper summarizes the JHU team's efforts in tracks 1 and 2 of the CHiME-6 challenge for distant multi-microphone conversational speech diarization and recognition in everyday home environments. We explore multi-array processing techniques at each stage of the pipeline, such as multi-array guided source separation (GSS) for enhancement and acoustic model training data, posterior fusion for speech activity detection, PLDA score fusion for diarization, and lattice combination for automatic speech recognition (ASR). We also report results with different acoustic model architectures, and integrate other techniques such as online multi-channel weighted prediction error (WPE) dereverberation and variational Bayes-hidden Markov model (VB-HMM) based overlap assignment to deal with reverberation and overlapping speakers, respectively. As a result of these efforts, our ASR systems achieve a word error rate of 40.5% and 67.5% on tracks 1 and 2, respectively, on the evaluation set. This is an improvement of 10.8% and 10.4% absolute, over the challenge baselines for the respective tracks.

show abstract

Section: Guided Source Separation (Gss)mentioning

confidence: 99%

The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge

Arora

Raj

Subramanian

et al. 2020

6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020)

View full text Add to dashboard Cite

show abstract

“…Due to its importance in the front end of speech signal processing, speech separation has been an important research direction in academic and industry fields. It has derived a series of cutting-edge applications in ASR (Automatic Speech Recognition) [28,29], SED (Sound Event Detection) [30,31,32] and other areas, such as call customer service channels [33], multi-speaker meeting minutes [34] and target instruction extraction of smart speakers in domestic scene [35]. Speech enhancement and speech separation methods have undergone a long development.…”

Section: Introductionmentioning

confidence: 99%

Using iterative adaptation and dynamic mask for child speech extraction under real-world multilingual conditions

Cheng,

Du,

Niu

et al. 2023

Speech Communication

View full text Add to dashboard Cite

“…Recently, the multi-channel speech separation achieves good performance [13,14] and has been successfully integrated into conversation transcription systems [15]. However, the improvement has still been limited with single channel input for the conversational tasks [16,17,18].…”

Section: Introductionmentioning

confidence: 99%

Investigation of Practical Aspects of Single Channel Speech Separation for ASR

Wu,

Chen,

Chen

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Speech separation has been successfully applied as a frontend processing module of conversation transcription systems thanks to its ability to handle overlapped speech and its flexibility to combine with downstream tasks such as automatic speech recognition (ASR). However, a speech separation model often introduces target speech distortion, resulting in a sub-optimum word error rate (WER). In this paper, we describe our efforts to improve the performance of a single channel speech separation system. Specifically, we investigate a two-stage training scheme that firstly applies a feature level optimization criterion for pretraining, followed by an ASR-oriented optimization criterion using an end-to-end (E2E) speech recognition model. Meanwhile, to keep the model light-weight, we introduce a modified teacher-student learning technique for model compression. By combining those approaches, we achieve a absolute average WER improvement of 2.70% and 0.77% using models with less than 10M parameters compared with the previous state-of-theart results on the LibriCSS dataset for utterance-wise evaluation and continuous evaluation, respectively.

show abstract

Guided Source Separation Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR

Cited by 7 publications

References 27 publications

The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge

The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge

Using iterative adaptation and dynamic mask for child speech extraction under real-world multilingual conditions

Investigation of Practical Aspects of Single Channel Speech Separation for ASR

Contact Info

Product

Resources

About