Previous studies support the idea of merging auditory-based Gabor features with deep learning architectures to achieve robust automatic speech recognition, however, the cause behind the gain of such combination is still unknown. We believe these representations provide the deep learning decoder with more discriminable cues. Our aim with this paper is to validate this hypothesis by performing experiments with three different recognition tasks (Aurora 4, CHiME 2 and CHiME 3) and assess the discriminability of the information encoded by Gabor filterbank features. Additionally, to identify the contribution of low, medium and high temporal modulation frequencies subsets of the Gabor filterbank were used as features (dubbed LTM, MTM and HTM respectively). With temporal modulation frequencies between 16 and 25 Hz, HTM consistently outperformed the remaining ones in every condition, highlighting the robustness of these representations against channel distortions, low signal-to-noise ratios and acoustically challenging real-life scenarios with relative improvements from 11 to 56% against a Mel-filterbank-DNN baseline. To explain the results, a measure of similarity between phoneme classes from DNN activations is proposed and linked to their acoustic properties. We find this measure to be consistent with the observed error rates and highlight specific differences on phoneme level to pinpoint the benefit of the proposed features.
The advantage and limitations of utilizing automatic speech recognition (ASR) techniques for modelling human speech recognition are investigated for a set of ''critical'' speech maskers for which many standard models of human speech recognition fail. A deep neural net (DNN)-based ASR system utilizing a closed-set sentence recognition test is used to model the speech recognition threshold (SRT) of normal-hearing listeners for a variety of noise types. The benchmark data from Schubotz et al. (2016) include SRTs measured in conditions with an increasing complexity in terms of spectro-temporal modulation (from stationary speech-shaped noise to a single interfering talker). The DNN-based model as proposed in Spille et al. (2018) produces a higher prediction accuracy than baseline models (i.e., SII, ESII, STOI, and mr-sESPM) even though it does not require a clean speech reference signal (as is the case for most auditory model-based SRT predictions). The most accurate predictions are obtained with multi-condition training with known noise types and ASR features that explicitly account for temporal modulations in noisy sentences. Another advantage of the approach is that the DNN can serve as valuable analysis tool to uncover signal recognition strategies: For instance, by identifying the most relevant cues for correct classification in modulated noise, it is shown that the DNN is listening in the dips. Finally, we present preliminary data indicating that the WER of the model can be replaced with an estimate of the WER, which does not require the transcript of utterances during test time and therefore eliminates an important limitation of the previous model that prevents it from being used in real-world scenarios.
To which extent can neural nets learn traditional signal processing stages of current robust ASR front-ends? Will neural nets replace the classical, often auditory-inspired feature extraction in the near future? To answer these questions, a DNN-based ASR system was trained and tested on the Aurora4 robust ASR task using various (intermediate) processing stages. Additionally, the training set was divided into several fractions to reveal the amount of data needed to account for a missing processing step on the input signal or prior knowledge about the auditory system. The DNN system was able to learn from ordinary spectrograms representations outperforming MFCC using 75% of the training set and almost as good as log-Mel-spectrograms with the full set; on the other hand, it was unable to compensate the robustness of auditory-based Gabor features, which even using 40% of the training data outperformed every other representation. The study concludes that even with deep learning approaches, current ASR systems still benefit from a suitable feature extraction.
Features, patterns and classifiers g More feature properties g Classifiers n The goal of a classifier is to partition feature space into classlabeled decision regions n Borders between decision regions are called decision boundaries Highly correlated features Non-linear separability Linear separability Multi-modal R1
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.