Performance monitoring for automatic speech recognition in noisy multi-channel environments

Meyer, Bernd T.; Mallidi, Sri Harish; Martínez, Angel Mario Castro; Payá-Vayá, Guillermo; Kayser, Hendrik; Heřmanský, Hynek

doi:10.1109/slt.2016.7846244

Cited by 11 publications

(9 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It appears that methods borrowed from ASR become more and more useful in HSR research now that the overall performance gap between humans and machines get smaller (or vanishes for single, well-studied databases [9]). In our experiments, the best correlations are obtained with the M-Measure, which was shown earlier to be clearly related to parameters that influence speech intelligibility in hearing aids, e.g., the optimal direction of a beamformer when spatial filtering is performed in multi-channel hearing aids [19]. This is one example of strategies developed for ASR (specifically stream-weighting in multi-stream ASR) which has a meaningful application in human speech perception (specifically hearing research), as advertised in [20].…”

Section: Discussionsupporting

confidence: 60%

“…This would require running a DNNclassifier on hearing aid hardware in real-time. As estimated in [19], a forward run of a standard DNN as used in our experiments is not possible on current hearing aid hardware due to limitations in power consumption. However, when the model complexity is reduced by a factor of 10, such real-time processing becomes feasible.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Single-Ended Prediction of Listening Effort Based on Automatic Speech Recognition

2017

Self Cite

View full text Add to dashboard Cite

A new, single-ended, i.e. reference-free measure for the prediction of perceived listening effort of noisy speech is presented. It is based on phoneme posterior probabilities (or posteriorgrams) obtained from a deep neural network of an automatic speech recognition system. Additive noisy or other distortions of speech tend to smear the posteriorgrams. The smearing is quantified by a performance measure, which is used as a predictor for the perceived listening effort required to understand the noisy speech. The proposed measure was evaluated using a database obtained from the subjective evaluation of noise reduction algorithms of commercial hearing aids. Listening effort ratings of processed noisy speech samples were gathered from 20 hearing-impaired subjects. Averaged subjective ratings were compared with corresponding predictions computed by the proposed new method, the ITU-T standard P.563 for single-ended speech quality assessment, the American National Standard ANIQUE+ for single-ended speech quality assessment, and a single-ended SNR estimator. The proposed method achieved a good correlation with mean subjective ratings and clearly outperformed the standard speech quality measures and the SNR estimator.

show abstract

Section: Discussionsupporting

confidence: 60%

Section: Discussionmentioning

confidence: 99%

Single-Ended Prediction of Listening Effort Based on Automatic Speech Recognition

2017

Self Cite

View full text Add to dashboard Cite

show abstract

“…Before calculating the MTD, the context-dependent triphones from the DNN are grouped to approximately 40 monophones. This allows to visualize the output ( Figure 1), is computationally cheaper, and produces similar results than using triphone activations [15]. Note that a forward-run of the model does not require a decoding step with the HMM or a word transcript, since it relies on the DNN output alone.…”

Section: Speech Quality Prediction Systemmentioning

confidence: 99%

Prediction of Perceived Speech Quality Using Deep Machine Listening

2018

Self Cite

View full text Add to dashboard Cite

Subjective ratings of speech quality (SQ) are essential for evaluating algorithms for speech transmission and enhancement. In this paper we explore a non-intrusive model for SQ prediction based on the output of a deep neural net (DNN) from a regular automatic speech recognizer. The degradation of phoneme probabilities obtained from the net is quantified with the mean temporal distance proposed earlier for multi-stream ASR. The SQ predicted with this method is compared with average subject ratings from the TCD-VoIP speech quality database that covers several effects of SQ degradation that can occur in VoIP applications such as clipping, packet loss, echo effects, background noise, and competing speakers. Our approach is tailored to speech and therefore not applicable when quality is degraded by a competing speaker, which is reflected by an insignificant correlation between model output and subjective SQ. In all other conditions mentioned above, the model reaches an average correlation of r = 0.87, which is higher than the correlation achieved with the baseline ITU-T P.563 (r = 0.71) and the American National Standard ANIQUE+ (r = 0.75). Since the most robust ASR system is not necessarily the best model to predict SQ, we investigate the effect of the amount of training data on quality prediction.

show abstract

“…This process of enhancing is of great importance for many applications, such as mobile phone communications, VoIP, teleconferencing systems, hearing aids, and automatic speech recognition (ASR) systems. For example, several authors have reported a decrease in the performance of ASR in the presence of noise recently [2][3][4], and there is concern about the performance of devices for hearing aids as well [5,6].…”

Section: Introductionmentioning

confidence: 99%

An Experimental Study on Speech Enhancement Based on a Combination of Wavelets and Deep Learning

Gutiérrez-Muñoz

Coto-Jiménez

2022

Computation

View full text Add to dashboard Cite

The purpose of speech enhancement is to improve the quality of speech signals degraded by noise, reverberation, or other artifacts that can affect the intelligibility, automatic recognition, or other attributes involved in speech technologies and telecommunications, among others. In such applications, it is essential to provide methods to enhance the signals to allow the understanding of the messages or adequate processing of the speech. For this purpose, during the past few decades, several techniques have been proposed and implemented for the abundance of possible conditions and applications. Recently, those methods based on deep learning seem to outperform previous proposals even on real-time processing. Among the new explorations found in the literature, the hybrid approaches have been presented as a possibility to extend the capacity of individual methods, and therefore increase their capacity for the applications. In this paper, we evaluate a hybrid approach that combines both deep learning and wavelet transformation. The extensive experimentation performed to select the proper wavelets and the training of neural networks allowed us to assess whether the hybrid approach is of benefit or not for the speech enhancement task under several types and levels of noise, providing relevant information for future implementations.

show abstract

Performance monitoring for automatic speech recognition in noisy multi-channel environments

Cited by 11 publications

References 15 publications

Single-Ended Prediction of Listening Effort Based on Automatic Speech Recognition

Single-Ended Prediction of Listening Effort Based on Automatic Speech Recognition

Prediction of Perceived Speech Quality Using Deep Machine Listening

An Experimental Study on Speech Enhancement Based on a Combination of Wavelets and Deep Learning

Contact Info

Product

Resources

About