Voices Obscured in Complex Environmental Settings (VOICES) corpus

Richey, Colleen; Barrios, María Auxiliadora; Armstrong, Zeb; Bartels, Chris; Franco, Horacio; Graciarena, Martin; Lawson, Aaron; Nandwana, Mahesh Kumar; Stauffer, Allen; Hout, Julien van; Gamble, Paul; Hetherly, Jeff; Stephenson, Cory; Ni, Karl

doi:10.48550/arxiv.1804.05053

Cited by 26 publications

(24 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus we experimented with VOiCES challenge dataset. VOiCES corpus [15] is released as part of "The voices from a distance challenge 2019" [16] of Interspeech 2019. For the ASR fixed conditons track, the training set consists of 80-hours subset of LibriSpeech corpus…”

Section: Voices Corpus Asrmentioning

confidence: 99%

“…In these experiments, we show that the proposed approach improves over the state-of-the-art ASR systems based on log-mel features as well as other past approaches proposed for speech dereverberation and denoising based on deep learning. In addition, we also extend the approach to large vocabulary speech recognition on VOiCES dataset [15,16].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Dereverberation of Autoregressive Envelopes for Far-field Speech Recognition

Purushothaman¹,

Sreeram²,

Kumar³

et al. 2021

Preprint

View full text Add to dashboard Cite

The task of speech recognition in far-field environments is adversely affected by the reverberant artifacts that elicit as the temporal smearing of the sub-band envelopes. In this paper, we develop a neural model for speech dereverberation using the long-term sub-band envelopes of speech. The sub-band envelopes are derived using frequency domain linear prediction (FDLP) which performs an autoregressive estimation of the Hilbert envelopes. The neural dereverberation model estimates the envelope gain which when applied to reverberant signals suppresses the late reflection components in the far-field signal. The dereverberated envelopes are used for feature extraction in speech recognition. Further, the sequence of steps involved in envelope dereverberation, feature extraction and acoustic modeling for ASR can be implemented as a single neural processing pipeline which allows the joint learning of the dereverberation network and the acoustic model. Several experiments are performed on the REVERB challenge dataset, CHiME-3 dataset and VOiCES dataset. In these experiments, the joint learning of envelope dereverberation and acoustic model yields significant performance improvements over the baseline ASR system based on log-mel spectrogram as well as other past approaches for dereverberation (average relative improvements of 10-24% over the baseline system). A detailed analysis on the choice of hyper-parameters and the cost function involved in envelope dereverberation is also provided.

show abstract

Section: Voices Corpus Asrmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Dereverberation of Autoregressive Envelopes for Far-field Speech Recognition

Purushothaman¹,

Sreeram²,

Kumar³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The training set of the VOiCES corpus [23,40] consists of 80-hour subset of the clean LibriSpeech corpus [44]. The training set has close talking microphone recordings from 427 speakers recorded in clean environments.…”

Section: Datamentioning

confidence: 99%

“…We also explore regularization of the model based on boundary equilibrium generative adversarial networks (BEGAN) [21]. In various E2E ASR experiments performed on the REVERB challenge dataset [22] as well as the VOiCES dataset [23], we show that the proposed approach improves over the stateof-art E2E ASR systems based on log-mel features with generalized (GEV) beamforming and weighted prediction error (WPE) based enhancement.…”

Section: Introductionmentioning

confidence: 95%

End-to-End Speech Recognition With Joint Dereverberation Of Sub-Band Autoregressive Envelopes

Kumar¹,

Purushothaman²,

Sreeram³

et al. 2021

Preprint

View full text Add to dashboard Cite

The end-to-end (E2E) automatic speech recognition (ASR) offers several advantages over previous efforts for recognizing speech. However, in reverberant conditions, E2E ASR is a challenging task as the long-term sub-band envelopes of the reverberant speech are temporally smeared. In this paper, we develop a feature enhancement approach using a neural model operating on sub-band temporal envelopes. The temporal envelopes are modeled using the framework of frequency domain linear prediction (FDLP). The neural enhancement model proposed in this paper performs an envelope gain based enhancement of temporal envelopes. The model architecture consists of a combination of convolutional and long short term memory (LSTM) neural network layers. Further, the envelope dereverberation, feature extraction and acoustic modeling using transformer based E2E ASR can all be jointly optimized for the speech recognition task. The joint optimization ensures that the dereverberation model targets the ASR cost function. We perform E2E speech recognition experiments on the REVERB challenge dataset as well as on the VOiCES dataset. In these experiments, the proposed joint modeling approach yields significant improvements compared to baseline E2E ASR system (average relative improvements of 21% on the REVERB challenge dataset and about 10% on the VOiCES dataset).

show abstract

“…Recently, far-field speaker recognition attracts more and more attention from the research community. The Voices Obscured in Complex Environmental Settings (VOiCES) Challenge launched in 2019 aims to benchmark state-of-theart speech processing methods in far-field and noisy conditions [24]. The wake-up word dataset Hi Mia has also been released to facilitate the studies in far-field speaker recognition [25].…”

Section: Introductionmentioning

confidence: 99%

The INTERSPEECH 2020 Far-Field Speaker Verification Challenge

Qin

et al. 2020

Interspeech 2020

View full text Add to dashboard Cite

The INTERSPEECH 2020 Far-Field Speaker Verification Challenge (FFSVC 2020) addresses three different research problems under well-defined conditions: far-field text-dependent speaker verification from single microphone array, far-field textindependent speaker verification from single microphone array, and far-field text-dependent speaker verification from distributed microphone arrays. All three tasks pose a cross-channel challenge to the participants. To simulate the real-life scenario, the enrollment utterances are recorded from close-talk cellphone, while the test utterances are recorded from the far-field microphone arrays. In this paper, we describe the database, the challenge, and the baseline system, which is based on a ResNetbased deep speaker network with cosine similarity scoring. For a given utterance, the speaker embeddings of different channels are equally averaged as the final embedding. The baseline system achieves minDCFs of 0.62, 0.66, and 0.64 and EERs of 6.27%, 6.55%, and 7.18% for task 1, task 2, and task 3, respectively.

show abstract

Voices Obscured in Complex Environmental Settings (VOICES) corpus

Cited by 26 publications

References 3 publications

Dereverberation of Autoregressive Envelopes for Far-field Speech Recognition

Dereverberation of Autoregressive Envelopes for Far-field Speech Recognition

End-to-End Speech Recognition With Joint Dereverberation Of Sub-Band Autoregressive Envelopes

The INTERSPEECH 2020 Far-Field Speaker Verification Challenge

Contact Info

Product

Resources

About