Wideband Audio Waveform Evaluation Networks: Efficient, Accurate Estimation of Speech Qualities

Catellier, Andrew; Voran, S.

doi:10.36227/techrxiv.20154785

Cited by 1 publication

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other NR tools produce estimates of objective values including FR speech quality values [23], [30], [32], [38], [44], [51], [54], [56], [57], FR speech intelligibility values [30], [32], [38], [44], [52], [54], [56], [57], speech transmission index [22], codec bit-rate [46], and detection of specific impairments, artifacts, or noise types [34], [39], [41], [52]. Some of these tools perform a single task and others perform multiple tasks.…”

Section: A Existing Machine Learning Approachesmentioning

confidence: 99%

“…We then trained WAWEnets to estimate these FR values using only the impaired segments. As in [30] and [32], we performed inverse phase augmentation (IPA) to augment all datasets in order to train WAWEnet to learn invariance to waveform phase inversion. This augmentation increased the amount of data available to just over 500 hours of total speech data.…”

Section: Hoursmentioning

confidence: 99%

See 1 more Smart Citation

Wideband Audio Waveform Evaluation Networks: Efficient, Accurate Estimation of Speech Qualities

Catellier,

Voran

2023

IEEE Access

View full text Add to dashboard Cite

Speech quality and speech intelligibility can vary dramatically across the wide range of currently available telecommunications systems, devices, and operating environments. This creates a strong demand for efficient real-time measurements of quality and intelligibly. Wideband Audio Waveform Evaluation Networks (WAWEnets) are convolutional neural networks that operate directly on wideband audio waveforms in order to produce evaluations of those waveforms. In the present work these evaluations give qualities of telecommunications speech (e.g., noisiness, intelligibility, overall speech quality). WAWEnets are no-reference networks because they do not require ''reference'' (original or undistorted) versions of the waveforms they evaluate. Our initial WAWEnet publication introduced four WAWEnets and each emulated the output of an established full-reference speech quality or intelligibility estimation algorithm. We have updated the WAWEnet architecture to be more efficient and effective. Here we present a single WAWEnet that closely tracks seven different quality and intelligibility values with per-segment correlations in the range of 0.92 to 0.96. We create a second network that additionally tracks four subjective speech quality dimensions. We offer a third network that focuses on just subjective quality scores and achieves a per-segment correlation of 0.97. The performance of our WAWEnet architecture compares favorably to models with orders-of-magnitude more parameters and computational complexity. This work has leveraged 334 hours of speech in 13 languages, over two million full-reference target values and over 93,000 subjective mean opinion scores. We also interpret the operation of WAWEnets and identify the key to their operation using the language of signal processing: ReLUs strategically move spectral information from non-DC components into the DC component. The DC values of 96 output signals define a vector in a 96-D latent space and this vector is then mapped to a quality or intelligibility value for the input waveform.

show abstract