Realistic Multi-Microphone Data Simulation for Distant Speech Recognition

Ravanelli, Mirco; Svaizer, Piergiorgio; Omologo, Maurizio

doi:10.21437/interspeech.2016-731

Cited by 32 publications

(32 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To validate our model in a more challenging scenario, experiments were also conducted in distant-talking conditions with the DIRHA-English dataset 4 [36,37]. Training was based on the original WSJ-5k corpus (consisting of 7, 138 sentences uttered by 83 speakers) that was contaminated with a set of impulse responses measured in a domestic environment [37]. The test phase was carried out with the real part of the dataset, consisting of 409 WSJ sentences uttered in the aforementioned environment by six native American speakers.…”

Section: Corpora and Tasksmentioning

confidence: 99%

The Pytorch-kaldi Speech Recognition Toolkit

Ravanelli¹,

Parcollet²,

Bengio³

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

217

135

View full text Add to dashboard Cite

The availability of open-source software is playing a remarkable role in the popularization of speech recognition and deep learning. Kaldi, for instance, is nowadays an established framework used to develop state-of-the-art speech recognizers. PyTorch is used to build neural networks with the Python language and has recently spawn tremendous interest within the machine learning community thanks to its simplicity and flexibility.The PyTorch-Kaldi project aims to bridge the gap between these popular toolkits, trying to inherit the efficiency of Kaldi and the flexibility of PyTorch. PyTorch-Kaldi is not only a simple interface between these software, but it embeds several useful features for developing modern speech recognizers. For instance, the code is specifically designed to naturally plug-in user-defined acoustic models. As an alternative, users can exploit several pre-implemented neural networks that can be customized using intuitive configuration files. PyTorch-Kaldi supports multiple feature and label streams as well as combinations of neural networks, enabling the use of complex neural architectures. The toolkit is publicly-released along with a rich documentation and is designed to properly work locally or on HPC clusters.Experiments, that are conducted on several datasets and tasks, show that PyTorch-Kaldi can effectively be used to develop modern state-of-the-art speech recognizers.

show abstract

Section: Corpora and Tasksmentioning

confidence: 99%

The Pytorch-kaldi Speech Recognition Toolkit

Ravanelli¹,

Parcollet²,

Bengio³

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

217

135

View full text Add to dashboard Cite

show abstract

“…The assumption is that much like blind or non-intrusive acoustic parameter estimation can be used as a proxy for estimating ASR performance [16], a neural network model can be trained to extract features from reverberant speech that are correlated with WER. The proposed method assumes reverberant speech samples transcribed by an ASR engine and the corresponding WER per utterance calculated by (7). The same data split as described in Section 2 is used.…”

Section: Predicting Wer Blindly From Reverberant Speech Using a Cnn-lmentioning

confidence: 99%

Predicting Word Error Rate for Reverberant Speech

Gamper

Emmanouilidou

Braun

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Reverberation negatively impacts the performance of automatic speech recognition (ASR). Prior work on quantifying the effect of reverberation has shown that clarity (C50), a parameter that can be estimated from the acoustic impulse response, is correlated with ASR performance. In this paper we propose predicting ASR performance in terms of the word error rate (WER) directly from acoustic parameters via a polynomial, sigmoidal, or neural network fit, as well as blindly from reverberant speech samples using a convolutional neural network (CNN). We carry out experiments on two state-ofthe-art ASR models and a large set of acoustic impulse responses (AIRs). The results confirm C50 and C80 to be highly correlated with WER, allowing WER to be predicted with the proposed fitting approaches. The proposed non-intrusive CNN model outperforms C50-based WER prediction, indicating that WER can be estimated blindly, i.e., directly from the reverberant speech samples without knowledge of the acoustic parameters.

show abstract

“…To evaluate the impact of room acoustics on the accuracy of speaker verification, a proper dataset of reverberant audio is needed. An alternative that fills a qualitative gap between unsatisfying simulation (despite the improvement of realism reported in Ravanelli et al, 2016) and costly and demanding real speaker recording, is retransmission. To our advantage, we can also use the fact that a known dataset can be retransmitted so that the performances are readily comparable with known benchmarks.…”

Section: Nist Retransmitted Set (But-ret)mentioning

confidence: 99%

Analysis of DNN Speech Signal Enhancement for Robust Speaker Recognition

Novotny

Plchot

Glembek

et al. 2019

Computer Speech & Language

View full text Add to dashboard Cite

In this work, we present an analysis of a DNN-based autoencoder for speech enhancement, dereverberation and denoising. The target application is a robust speaker verification (SV) system. We start our approach by carefully designing a data augmentation process to cover wide range of acoustic conditions and obtain rich training data for various components of our SV system. We augment several well-known databases used in SV with artificially noised and reverberated data and we use them to train a denoising autoencoder (mapping noisy and reverberated speech to its clean version) as well as an x-vector extractor which is currently considered as state-of-the-art in SV. Later, we use the autoencoder as a preprocessing step for text-independent SV system. We compare results achieved with autoencoder enhancement, multi-condition PLDA training and their simultaneous use. We present a detailed analysis with various conditions of NIST SRE 2010, 2016, PRISM and with re-transmitted data. We conclude that the proposed preprocessing can significantly improve both i-vector and x-vector baselines and that this technique can be used to build a robust SV system for various target domains.

show abstract

Realistic Multi-Microphone Data Simulation for Distant Speech Recognition

Cited by 32 publications

References 30 publications

The Pytorch-kaldi Speech Recognition Toolkit

The Pytorch-kaldi Speech Recognition Toolkit

Predicting Word Error Rate for Reverberant Speech

Analysis of DNN Speech Signal Enhancement for Robust Speaker Recognition

Contact Info

Product

Resources

About