The CHiME Challenges: Robust Speech Recognition in Everyday Environments

Barker, Jon; Marxer, Ricard; Watanabe, Shinji

doi:10.1007/978-3-319-64680-0_14

Cited by 18 publications

(7 citation statements)

References 24 publications

(38 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This model (referred to as "d-vector V2" in [13]) has a 3.06% equal error rate (EER) on our internal en-US phone audio test dataset, compared to the 3.55% EER of the one reported in [10]. VoiceFilter: We cannot use a "standard" benchmark corpus for speech separation, such as one of the CHiME challenges [19], because we need a clean reference utterance of each target speaker in order to compute speaker embeddings. Instead, we train and evaluate the VoiceFilter system on our own generated data, derived either from the VCTK dataset [20] or from the LibriSpeech dataset [16].…”

Section: Datasetsmentioning

confidence: 99%

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

et al. 2019

View full text Add to dashboard Cite

In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings;(2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.

show abstract

Section: Datasetsmentioning

confidence: 99%

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

et al. 2019

View full text Add to dashboard Cite

show abstract

“…However, when operating on ASR transcripts (including recognition errors), the speech-based models were competitive in performance with the text-based models. In particular, prior work has found that WER of ≈ 30% is typical for modern ASR in many real-world settings or without good-quality microphones (Lasecki et al, 2012;Barker et al, 2017). When operating on such ASR output, the RMS error of the speech-based model and the text-based model were comparable.…”

Section: Modelmentioning

confidence: 91%

Modeling Acoustic-Prosodic Cues for Word Importance Prediction in Spoken Dialogues

Kafle¹,

Alm²,

Huenerfauth³

2019

Proceedings of the Eighth Workshop on Speech and Language Processing for Assistive Technologies

View full text Add to dashboard Cite

Prosodic cues in conversational speech aid listeners in discerning a message. We investigate whether acoustic cues in spoken dialogue can be used to identify the importance of individual words to the meaning of a conversation turn. Individuals who are Deaf and Hard of Hearing often rely on real-time captions in live meetings. Word error rate, a traditional metric for evaluating automatic speech recognition (ASR), fails to capture that some words are more important for a system to transcribe correctly than others. We present and evaluate neural architectures that use acoustic features for 3-class word importance prediction. Our model performs competitively against state-ofthe-art text-based word-importance prediction models, and it demonstrates particular benefits when operating on imperfect ASR output.

show abstract

“…Therefore, several speech corpora were recorded. For instance, the CHiME corpora [11] are made of English speech recordings in different noise conditions. In particular, the CHiME-5 data set is composed of recordings of 4-person dinner parties (host couple and guests).…”

Section: State Of the Art: Available Corporamentioning

confidence: 99%

Context-Aware Voice-Based Interaction in Smart Home - VocADom@A4H Corpus Collection and Empirical Assessment of Its Usefulness

Portet

Caffiau

Ringeval

et al. 2019

2019 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf

View full text Add to dashboard Cite

Smart homes aim at enhancing the quality of life of people at home by the use of home automation systems and Ambient Intelligence. Most of these smart homes provide enhanced interaction by relying on context-aware systems learned on data. Whereas voice-based interaction is the current emerging trend, most available corpora are either concerned only with home automation sensors or only with audio technology, which limits the development of context-aware voice-based systems. This paper presents the VocADom@A4H corpus, which is a dataset composed of users' interactions recorded in a fully equipped Smart Home. About 12 hours of multichannel distant speech signal synchronized with logs of an openHAB home automation system were collected from 11 participants who performed activities of daily living with the presence of real-life noises, such as other persons speaking, use of vacuum cleaner, TV, etc. This corpus can serve as a valuable material for studies in pervasive intelligence, such as human tracking, human activity recognition, context aware interaction, and robust distant speech processing in the home. Experiments performed on multichannel speech and home automation sensors data for robust voice activity detection and multiresident localization show the potential of the corpus to support the development of context-aware smart home systems.

show abstract

The CHiME Challenges: Robust Speech Recognition in Everyday Environments

Cited by 18 publications

References 24 publications

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

Modeling Acoustic-Prosodic Cues for Word Importance Prediction in Spoken Dialogues

Context-Aware Voice-Based Interaction in Smart Home - VocADom@A4H Corpus Collection and Empirical Assessment of Its Usefulness

Contact Info

Product

Resources

About