Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1374
|View full text |Cite
|
Sign up to set email alerts
|

Prediction of Perceived Speech Quality Using Deep Machine Listening

Abstract: Subjective ratings of speech quality (SQ) are essential for evaluating algorithms for speech transmission and enhancement. In this paper we explore a non-intrusive model for SQ prediction based on the output of a deep neural net (DNN) from a regular automatic speech recognizer. The degradation of phoneme probabilities obtained from the net is quantified with the mean temporal distance proposed earlier for multi-stream ASR. The SQ predicted with this method is compared with average subject ratings from the TCD-… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
12
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 18 publications
(12 citation statements)
references
References 17 publications
0
12
0
Order By: Relevance
“…Furthermore, it was challenging to find suitable features, which rely on the degraded output signal only, to estimate the dimension Discontinuity. Recently, the use of deep learning for audio and speech classification and recognition tasks has become increasingly popular [12,13,14,15,16,17,18]. In [19], we also showed that convolutional neural networks (CNN) can be used to detect packetloss concealment in speech signals, which indicates that they are suitable to predict the perceived Discontinuity as well.…”
Section: Introductionmentioning
confidence: 90%
“…Furthermore, it was challenging to find suitable features, which rely on the degraded output signal only, to estimate the dimension Discontinuity. Recently, the use of deep learning for audio and speech classification and recognition tasks has become increasingly popular [12,13,14,15,16,17,18]. In [19], we also showed that convolutional neural networks (CNN) can be used to detect packetloss concealment in speech signals, which indicates that they are suitable to predict the perceived Discontinuity as well.…”
Section: Introductionmentioning
confidence: 90%
“…More recently, [8] uses CNN to estimate the per-frame quality and adopt RNN to aggregate the per-frame values over time, to estimate the overall speech quality. [9] predicts speech quality by a model based on the outputs of an automatic speech recognizer, and in [10], a model based on a BiLSTM network is shown to assess speech quality.…”
Section: Introductionmentioning
confidence: 99%
“…Automatic speech recognition (ASR) systems incorporated with deep neural networks (DNN) perform as well as human speech recognition (HSR), and ASR systems perform similarly to HSR systems in certain situations [11,12]. Motivated by this fact, DNN-based ASR systems are used to predict SI, which is obtained in human experiments [13,14,15,16,17,18,19,20]. In addition, there is another advantage in that ASR does not need reference signals for comparison.…”
Section: Introductionmentioning
confidence: 99%
“…They also reported that the prediction performance depends on whether or not masker types of test signals are included in the training datasets. In [15,16,17,19,20], the subjective SI of speech masked by various types of noise and noisy speech processed by noise reduction algorithms in hearing aids and microphones were predicted by using DNN-based ASR. The authors used the mean temporal distance (MTD) of phone posteriors, that is, the softmax outputs of a DNN.…”
Section: Introductionmentioning
confidence: 99%