Speech enhancement for robust automatic speech recognition: Evaluation using a baseline system and instrumental measures

Moore, Alastair H.; Parada, Pablo Peso; Naylor, Patrick A.

doi:10.1016/j.csl.2016.11.003

Cited by 29 publications

(18 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Please note that the correlation score between the CEG and WER for multi-condition training tends to be smaller than that for clean-condition training because the number of the same scores of WER corresponding to different scores of CEG for multi-condition training is larger, which leads to worse correlation statistics. Unlike the conclusions in [10,11], we observe that the correlation scores between the STOI and WER tend to be larger than those between the PESQ and WER for multi-condition training and the contrary conclusion could be drawn for clean-condition training, which may indicate that the recognition performance depends more largely on speech quality for clean-condition training and on speech intelligibility for multi-condition training. Besides, it is noted that the acoustic confidence measure and the proposed CEG have positive correlations with WER, on the contrary, PESQ and STOI have negative correlations with WER.…”

Section: Experimental Settingcontrasting

confidence: 99%

“…For example, resulting speech enhanced by OM-LSA could improve recognition accuracy for cleancondition training regardless of its worse STOI shown in Table 2. Accordingly, the conclusion in [10,11] that the correlation coefficient between the WER and STOI is higher than other distortion measures (e.g., PESQ) is not accurate and reliable enough. Some researches [22,32] suggested by the conclusion in [10,11] designed a speech enhancement frontend to especially improve STOI and thus achieve better ASR performance.…”

Section: Comparison Of Evaluation Accuracymentioning

confidence: 99%

“…In [4][5][6][7], a very good correlation between WER and the perceptual evaluation of speech quality (PESQ) [8] for measuring speech quality has been verified. Recently, it has been shown that the correlation coefficient between WER and the short-time objective intelligibility (STOI) [9] for measuring speech intelligibility is higher than other distortion measures [10,11]. Besides, in [12,13], an acoustic confidence measure usually defined as the entropy of the posterior distribution from the output of the artificial neural network (ANN) for the acoustic model was proposed and shown a high degree of correlation with WER, where the discriminatory power of the ANN decreases and the posterior probabilities tend to become more uniform with a higher entropy.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Cross-Entropy-Guided (CEG) Measure for Speech Enhancement Front-End Assessing Performances of Back-End Automatic Speech Recognition

Chai

Lee

2019

Interspeech 2019

View full text Add to dashboard Cite

One challenging problem of robust automatic speech recognition (ASR) is how to measure the goodness of a speech enhancement algorithm without calculating word error rate (WER) due to the high costs of manual transcriptions, language modeling and decoding process. In this study, a novel cross-entropy-guided (CEG) measure is proposed for assessing if enhanced speech predicted by a speech enhancement algorithm would produce a good performance for robust ASR. CEG consists of three consecutive steps, namely the low-level representations via the feature extraction, high-level representations via the nonlinear mapping with the acoustic model, and the final CEG calculation between the high-level representations of clean and enhanced speech. Specifically, state posterior probabilities from the output of the neural network for the acoustic model are adopted as the high-level representations and a cross-entropy criterion is used to calculate CEG. Experimental results show that CEG could consistently yield the highest correlations with WER and achieve the most accurate assessment of the ASR performance when compared to distortion measures based on human auditory perception and an acoustic confidence measure. Potentially, CEG could be adopted to guide the parameter optimization of deep learning based speech enhancement algorithms to further improve the ASR performance.

show abstract

Section: Experimental Settingcontrasting

confidence: 99%

Section: Comparison Of Evaluation Accuracymentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Cross-Entropy-Guided (CEG) Measure for Speech Enhancement Front-End Assessing Performances of Back-End Automatic Speech Recognition

Chai

Lee

2019

Interspeech 2019

View full text Add to dashboard Cite

show abstract

“…Recent studies have reported a positive correlation between objective intelligibility scores and ASR performance [27,32]. In Table 2, we show the STOI and PESQ scores of enhanced speech processed by RLSE1 and RLSE2 at SNR levels of0 and 5 dB.…”

Section: Resultsmentioning

confidence: 97%

“…It has been reported that when the goal is to improve the ASR performance, ideal binary mask (IBM) is more suitable than ideal ratio mask (IRM) or directly mapping [27] to be used to design the SE system. Therefore, we implement an IBM-based SE system in this study.…”

Section: Ideal Binary Mask-based Se Systemmentioning

confidence: 99%

Reinforcement Learning Based Speech Enhancement for Robust Speech Recognition

Shen

Huang

Wang³

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Conventional deep neural network (DNN)-based speech enhancement (SE) approaches aim to minimize the mean square error (MSE) between enhanced speech and clean reference. The MSE-optimized model may not directly improve the performance of an automatic speech recognition (ASR) system. If the target is to minimize the recognition error, the recognition results should be used to design the objective function for optimizing the SE model. However, the structure of an ASR system, which consists of multiple units, such as acoustic and language models, is usually complex and not differentiable. In this study, we proposed to adopt the reinforcement learning algorithm to optimize the SE model based on the recognition results. We evaluated the propsoed SE system on the Mandarin Chinese broadcast news corpus (MATBN). Experimental results demonstrate that the proposed method can effectively improve the ASR results with a notable 12.40% and 19.23% error rate reductions for signal to noise ratio at 0 dB and 5 dB conditions, respectively.Index Terms-reinforcement learning, automatic speech recognition, speech enhancement, deep neural network, character error rate

show abstract

Comparative Analysis of Intelligent Personal Agent Performance

Herbert

Kang

2019

Knowledge Management and Acquisition for Intelligent Systems

View full text Add to dashboard Cite

Speech enhancement for robust automatic speech recognition: Evaluation using a baseline system and instrumental measures

Cited by 29 publications

References 11 publications

A Cross-Entropy-Guided (CEG) Measure for Speech Enhancement Front-End Assessing Performances of Back-End Automatic Speech Recognition

A Cross-Entropy-Guided (CEG) Measure for Speech Enhancement Front-End Assessing Performances of Back-End Automatic Speech Recognition

Reinforcement Learning Based Speech Enhancement for Robust Speech Recognition

Comparative Analysis of Intelligent Personal Agent Performance

Contact Info

Product

Resources

About