N-MTTL SI Model: Non-Intrusive Multi-Task Transfer Learning-Based Speech Intelligibility Prediction Model with Scenery Classification

Marcinek, Ĺuboš; Stone, Michael; Millman, Rebecca E.; Gaydecki, Patrick

doi:10.21437/interspeech.2021-1878

Cited by 5 publications

(5 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our work also shares connections with the literature on intelligibility prediction based on DNN representations [27,7,28,6,32]. More relevantly, in [6], SSL representations were used and optimized to predict multiple speech intelligibility indices.…”

Section: Related Workmentioning

confidence: 59%

“…The practical implications for intelligibility prediction research are evident from the study. Firstly, the results suggest that SSL representations should be chosen over supervisedlearned ones, contrary to what has been done in [7,27] for instance. Secondly, cat(z, z ref ) consistently and significantly outperforming sim(z, z ref ) as a feature for intelligibility prediction, indicates that learned non-linear functions over raw features should be preferred over linear similarity measures.…”

Section: On the Meaning Of Our Resultsmentioning

confidence: 79%

“…We would have liked to perform evaluations on more data, but we are limited by the scarcity of corpora for speech perception [27]. We consider extending our analyses to the Clarity Challenge corpus [5] of responses from hearing impaired listeners.…”

Section: On Limitations and Future Workmentioning

confidence: 99%

See 2 more Smart Citations

On the Benefits of Self-supervised Learned Speech Representations for Predicting Human Phonetic Misperceptions

Cuervo¹,

Marxer²

2023

Interspeech 2023

View full text Add to dashboard Cite

Deep neural networks (DNNs) trained by self-supervised learning (SSL) have recently been shown to produce representations similar to brain activations for the same speech input. Can SSL representations help to explain human speech perception errors? Aiming to shed light on this question, we study their use for phonetic misperception prediction. We extract representations from wav2vec 2.0, a recent SSL architecture for speech, and use them to compute features for a model predicting the presence of phonetic perception errors in speech-in-noise signals. We perform our experiments on a corpus of over 3000 consistent word-in-noise confusions in English. We consider multiple SSL-based features and compare them against conventional acoustic baselines and features obtained from DNNs fine-tuned through supervised learning for ASR. Our results show the superiority of SSL representations when extracted from the proper layer, further suggesting their potential to model human speech perception.

show abstract

Section: Related Workmentioning

confidence: 59%

Section: On the Meaning Of Our Resultsmentioning

confidence: 79%

See 1 more Smart Citation

On the Benefits of Self-supervised Learned Speech Representations for Predicting Human Phonetic Misperceptions

Cuervo¹,

Marxer²

2023

Interspeech 2023

View full text Add to dashboard Cite

show abstract

“…Other NR tools produce estimates of objective values including FR speech quality values [23], [30], [32], [38], [44], [51], [54], [56], [57], FR speech intelligibility values [30], [32], [38], [44], [52], [54], [56], [57], speech transmission index [22], codec bit-rate [46], and detection of specific impairments, artifacts, or noise types [34], [39], [41], [52]. Some of these tools perform a single task and others perform multiple tasks.…”

Section: A Existing Machine Learning Approachesmentioning

confidence: 99%

Wideband Audio Waveform Evaluation Networks: Efficient, Accurate Estimation of Speech Qualities

Catellier,

Voran

2023

IEEE Access

View full text Add to dashboard Cite

Speech quality and speech intelligibility can vary dramatically across the wide range of currently available telecommunications systems, devices, and operating environments. This creates a strong demand for efficient real-time measurements of quality and intelligibly. Wideband Audio Waveform Evaluation Networks (WAWEnets) are convolutional neural networks that operate directly on wideband audio waveforms in order to produce evaluations of those waveforms. In the present work these evaluations give qualities of telecommunications speech (e.g., noisiness, intelligibility, overall speech quality). WAWEnets are no-reference networks because they do not require ''reference'' (original or undistorted) versions of the waveforms they evaluate. Our initial WAWEnet publication introduced four WAWEnets and each emulated the output of an established full-reference speech quality or intelligibility estimation algorithm. We have updated the WAWEnet architecture to be more efficient and effective. Here we present a single WAWEnet that closely tracks seven different quality and intelligibility values with per-segment correlations in the range of 0.92 to 0.96. We create a second network that additionally tracks four subjective speech quality dimensions. We offer a third network that focuses on just subjective quality scores and achieves a per-segment correlation of 0.97. The performance of our WAWEnet architecture compares favorably to models with orders-of-magnitude more parameters and computational complexity. This work has leveraged 334 hours of speech in 13 languages, over two million full-reference target values and over 93,000 subjective mean opinion scores. We also interpret the operation of WAWEnets and identify the key to their operation using the language of signal processing: ReLUs strategically move spectral information from non-DC components into the DC component. The DC values of 96 output signals define a vector in a 96-D latent space and this vector is then mapped to a quality or intelligibility value for the input waveform.

show abstract

“…The non-intrusive speech quality assessment model called NISQA [50] produces estimates of subjective speech quality as well as four constituent dimensions: noisiness, coloration, discontinuity, and loudness. Other NR tools produce estimates of objective values including FR speech quality values [22], [29], [31], [42], [48], FR speech intelligibility values [29], [31], [42], [49], speech transmission index [21], codec bit-rate [43], and detection of specific impairments, artifacts, or noise types [33], [37], [39], [49]. Some of these tools perform a single task and others perform multiple tasks.…”

Section: A Existing Machine Learning Approachesmentioning

confidence: 99%

Wideband Audio Waveform Evaluation Networks: Efficient, Accurate Estimation of Speech Qualities

Catellier¹,

Voran²

2022

Preprint

View full text Add to dashboard Cite

Wideband Audio Waveform Evaluation Networks (WAWEnets) are convolutional neural networks that operate directly on wideband audio waveforms in order to produce evaluations of those waveforms. In the present work these evaluations give qualities of telecommunications speech (e.g., noisiness, intelligibility, overall speech quality). WAWEnets are no-reference networks because they do not require “reference” (original or undistorted) versions of the waveforms they evaluate. Our initial WAWEnet publication introduced four WAWEnets and each emulated the output of an established full-reference speech quality or intelligibility estimation algorithm. We have updated the WAWEnet architecture to be more efficient and effective. Here we present a single WAWEnet that closely tracks seven different quality and intelligibility values. We create a second network that additionally tracks four subjective speech quality dimensions. We offer a third network that focuses on just subjective quality scores and achieves very high levels of agreement. This work has leveraged 334 hours of speech in 13 languages, over two million full-reference target values and over 93,000 subjective mean opinion scores. We also interpret the operation of WAWEnets and identify the key to their operation using the language of signal processing: ReLUs strategically move spectral information from non-DC components into the DC component. The DC values of 96 output signals define a vector in a 96-D latent space and this vector is then mapped to a quality or intelligibility value for the input waveform.

show abstract

N-MTTL SI Model: Non-Intrusive Multi-Task Transfer Learning-Based Speech Intelligibility Prediction Model with Scenery Classification

Cited by 5 publications

References 29 publications

On the Benefits of Self-supervised Learned Speech Representations for Predicting Human Phonetic Misperceptions

On the Benefits of Self-supervised Learned Speech Representations for Predicting Human Phonetic Misperceptions

Wideband Audio Waveform Evaluation Networks: Efficient, Accurate Estimation of Speech Qualities

Wideband Audio Waveform Evaluation Networks: Efficient, Accurate Estimation of Speech Qualities

Contact Info

Product

Resources

About