The Fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, Task and Baselines

Barker, Jon; Watanabe, Shinji; Trmal, Jan

doi:10.21437/interspeech.2018-1768

Cited by 246 publications

(188 citation statements)

References 26 publications

Supporting

Mentioning

184

Contrasting

Unclassified

Order By: Relevance

“…Depending on how many arrays were available during test time, the challenge had a single (reference) array and a multiple array track. For more details about the corpus, the reader is referred to [11].…”

Section: Chime-5 Corpus Descriptionmentioning

confidence: 99%

See 1 more Smart Citation

An Investigation into the Effectiveness of Enhancement in ASR Training and Test for Chime-5 Dinner Party Transcription

Zorilă

Boeddeker

Doddipatla

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Despite the strong modeling power of neural network acoustic models, speech enhancement has been shown to deliver additional word error rate improvements if multi-channel data is available. However, there has been a longstanding debate whether enhancement should also be carried out on the ASR training data. In an extensive experimental evaluation on the acoustically very challenging CHiME-5 dinner party data we show that: (i) cleaning up the training data can lead to substantial error rate reductions, and (ii) enhancement in training is advisable as long as enhancement in test is at least as strong as in training. This approach stands in contrast and delivers larger gains than the common strategy reported in the literature to augment the training database with additional artificially degraded speech. Together with an acoustic model topology consisting of initial CNN layers followed by factorized TDNN layers we achieve with 41.6 % and 43.2 % WER on the DEV and EVAL test sets, respectively, a new single-system state-of-the-art result on the CHiME-5 data. This is a 8 % relative improvement compared to the best word error rate published so far for a speech recognizer without system combination.

show abstract

Section: Chime-5 Corpus Descriptionmentioning

confidence: 99%

“…We perform experiments using data from the CHiME-5 challenge which focuses on distant multi-microphone conversational ASR in real home environments [11]. The CHiME-5 data is heavily degraded by reverberation and overlapped speech.…”

Section: Introductionmentioning

confidence: 99%

An Investigation into the Effectiveness of Enhancement in ASR Training and Test for Chime-5 Dinner Party Transcription

Zorilă

Boeddeker

Doddipatla

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

show abstract

“…The CHiME challenge was launched in 2011 to address the problem of recognizing speech recorded with multiple microphones in real, noisy environments, such as a family's living room, a cafe, a busy intersection, on public transport and in pedestrian areas . To develop noise‐robust ASR systems, various approaches based not only on speech processing but also on sound‐source separation and machine learning have been widely investigated, and ASR performance has improved significantly as a result of past challenges.…”

Section: Recent Research Trends In Environmental Sound Processingmentioning

confidence: 99%

Environmental sound processing and its applications

Miyazaki

Toda

Hayashi

et al. 2019

IEEJ Transactions Elec Engng

View full text Add to dashboard Cite

As part of the effort to develop techniques for understanding environments using sound, many studies in the field of computational auditory scene analysis have focused on using computers to perform functions carried out naturally by the human auditory system. Thanks to recent progress in machine‐learning techniques, these environmental sound‐processing techniques have significantly improved and a widening variety of applications has resulted in considerable interest in this field. In this review, we introduce the fundamental techniques of environmental sound processing, as well as recent advances in front‐end and back‐end processing and potential applications for these techniques. Prospects for further progress in the field of environmental sound processing and the challenges still to be overcome are also discussed. © 2019 Institute of Electrical Engineers of Japan. Published by John Wiley & Sons, Inc.

show abstract

“…In addition, these methods were tested almost exclusively on small-scale segmented synthetic data and have not been applied to continuous conversational speech audio. Although the recently held CHiME-5 challenge helped the community make a step forward to a realistic setting, it still allowed the use of ground-truth speaker segments [22,23].…”

Section: Introductionmentioning

confidence: 99%

Advances in Online Audio-Visual Meeting Transcription

Yoshioka

Huang

Hurvitz

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

This paper describes a system that generates speaker-annotated transcripts of meetings by using a microphone array and a 360-degree camera. The hallmark of the system is its ability to handle overlapped speech, which has been an unsolved problem in realistic settings for over a decade. We show that this problem can be addressed by using a continuous speech separation approach. In addition, we describe an online audio-visual speaker diarization method that leverages face tracking and identification, sound source localization, speaker identification, and, if available, prior speaker information for robustness to various real world challenges. All components are integrated in a meeting transcription framework called SRD, which stands for "separate, recognize, and diarize". Experimental results using recordings of natural meetings involving up to 11 attendees are reported. The continuous speech separation improves a word error rate (WER) by 16.1% compared with a highly tuned beamformer. When a complete list of meeting attendees is available, the discrepancy between WER and speaker-attributed WER is only 1.0%, indicating accurate wordto-speaker association. This increases marginally to 1.6% when 50% of the attendees are unknown to the system.

show abstract

The Fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, Task and Baselines

Cited by 246 publications

References 26 publications

An Investigation into the Effectiveness of Enhancement in ASR Training and Test for Chime-5 Dinner Party Transcription

An Investigation into the Effectiveness of Enhancement in ASR Training and Test for Chime-5 Dinner Party Transcription

Environmental sound processing and its applications

Advances in Online Audio-Visual Meeting Transcription

Contact Info

Product

Resources

About