An Attention Enhanced Multi-Task Model for Objective Speech Assessment in Real-World Environments

Dong, Xuan; Williamson, Donald S.

doi:10.1109/icassp40776.2020.9053366

Cited by 34 publications

(29 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, it was developed for narrow-band applications and works well on limited impairment types. Recently, Deep Neural Networks (DNN) based approaches have been proposed to estimate the speech quality scores [6,7,8]. Some of these learning-based approaches use other objective metrics as the ground truth to train their speech quality predictor.…”

Section: Introductionmentioning

confidence: 99%

Dnsmos: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors

Reddy

Gopal

Cutler

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

124

View full text Add to dashboard Cite

Human subjective evaluation is the "gold standard" to evaluate speech quality optimized for human perception. Perceptual objective metrics serve as a proxy for subjective scores. The conventional and widely used metrics require a reference clean speech signal, which is unavailable in real recordings. Previous no-reference approaches correlate poorly with human ratings and are not widely adopted in the research community. One of the biggest use cases of these perceptual objective metrics is to evaluate noise suppression algorithms. This paper introduces a multi-stage self-teaching based perceptual objective metric that is designed to evaluate noise suppressors. The proposed method generalizes well in challenging test conditions with a high correlation to human ratings.

show abstract

Section: Introductionmentioning

confidence: 99%

Dnsmos: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors

Reddy

Gopal

Cutler

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

124

View full text Add to dashboard Cite

show abstract

“…Note that the comparison approaches: DNN, Quality-Net, NISQA and pBi-LSTM+Att are all trained and evaluated for a single target each time, while our approach assesses the speech from different perspectives at the same time. Therefore, we further compare our approach with our prior work (i.e., AMSA [23]) that is capable of estimating multiple objective targets. Results still demonstrate the superiority of our system in all these objective targets, where joint subjective and objective assessment improves performance.…”

Section: Experimental Results and Analysismentioning

confidence: 99%

“…The models are trained with 100 epochs using Adam optimizer and all models are trained and evaluated separately on COSINE and VOiCES datasets. We include 5 non-intrusive data-driven models as comparison approaches, including a multi-task model for objective score estimation (AMSA) [23], a deep neural network (DNN) model [18], Quality-Net [14], NISQA [13] and pBi-LSTM+Att [22]. Note that all these data-driven models except AMSA are separately trained for each target since they are not designed for multi-task estimation.…”

Section: Network Architecturementioning

confidence: 99%

See 1 more Smart Citation

An End-To-End Non-Intrusive Model for Subjective and Objective Real-World Speech Assessment Using a Multi-Task Framework

Zhang

Vyas

Dong

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Speech assessment is crucial for many applications, but current intrusive methods cannot be used in real environments. Data-driven approaches have been proposed, but they use simulated speech materials or only estimate objective scores. In this paper, we propose a novel multi-task non-intrusive approach that is capable of simultaneously estimating both subjective and objective scores of real-world speech, to help facilitate learning. This approach enhances our prior work, which estimated subjective mean-opinion scores, where our approach now operates directly on the time-domain signal in an end-to-end fashion. The proposed system is compared against several state-of-the-art systems. The experimental results show that our multi-task and end-to-end framework leads to higher correlation performance and lower prediction errors, according to multiple evaluation measures.

show abstract

“…Multi-task learning (MTL) [20] is an approach in deep learning when the model performs at least two tasks. MTL has been successfully applied in various fields [20] including speech (e.g., speech recognition [21], speech enhancement [22], or objective speech assessment in real-world environments by generating several objective intelligibility and quality scores [23]).…”

Section: Non-intrusive Multi-task Transfer Learning-based Speech Intelligibility Modelmentioning

confidence: 99%

N-MTTL SI Model: Non-Intrusive Multi-Task Transfer Learning-Based Speech Intelligibility Prediction Model with Scenery Classification

Marcinek¹,

Stone²,

Millman³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

The application of speech enhancement algorithms for hearing aids may not always be beneficial to increasing speech intelligibility. Therefore, a prior environment classification could be important. However, previous speech intelligibility models do not provide any additional information regarding the reason for a decrease in speech intelligibility. We propose a unique non-intrusive multi-task transfer learning-based speech intelligibility prediction model with scenery classification (N-MTTL SI model). The solution combines a Mel-spectrogram analysis of the degraded speech signal with transfer learning and multi-task learning to provide simultaneous speech intelligibility prediction (task 1) and scenery classification of ten real-world noise conditions (task 2). The model utilises a pre-trained ResNet architecture as an encoder for feature extraction. The prediction accuracy of the N-MTTL SI model for both tasks is high. Specifically, RMSE of speech intelligibility predictions for seen and unseen conditions is 3.76% and 4.06%. The classification accuracy is 98%. In addition, the proposed solution demonstrates the potential of using pre-trained deep learning models in the domain of speech intelligibility prediction.

show abstract

An Attention Enhanced Multi-Task Model for Objective Speech Assessment in Real-World Environments

Cited by 34 publications

References 25 publications

Dnsmos: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors

Dnsmos: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors

An End-To-End Non-Intrusive Model for Subjective and Objective Real-World Speech Assessment Using a Multi-Task Framework

N-MTTL SI Model: Non-Intrusive Multi-Task Transfer Learning-Based Speech Intelligibility Prediction Model with Scenery Classification

Contact Info

Product

Resources

About