Gabriel Mittag scite author profile

In this paper, we present an update to the NISQA speech quality prediction model that is focused on distortions that occur in communication networks. In contrast to the previous version, the model is trained end-to-end and the time-dependency modelling and time-pooling is achieved through a Self-Attention mechanism. Besides overall speech quality, the model also predicts the four speech quality dimensions Noisiness, Coloration, Discontinuity, and Loudness, and in this way gives more insight into the cause of a quality degradation. Furthermore, new datasets with over 13,000 speech files were created for training and validation of the model. The model was finally tested on a new, live-talking test dataset that contains recordings of real telephone calls. Overall, NISQA was trained and evaluated on 81 datasets from different sources and showed to provide reliable predictions also for unknown speech samples. The code, model weights, and datasets are open-sourced.

show abstract

Non-intrusive Speech Quality Assessment for Super-wideband Speech Communication Networks

Mittag

Möller

2019

View full text Add to dashboard Cite

Deep Learning Based Assessment of Synthetic Speech Naturalness

Mittag

Möller

2020

View full text Add to dashboard Cite

In this paper, we present a new objective prediction model for synthetic speech naturalness. It can be used to evaluate Text-To-Speech or Voice Conversion systems and works language independently. The model is trained end-to-end and based on a CNN-LSTM network that previously showed to give good results for speech quality estimation. We trained and tested the model on 16 different datasets, such as from the Blizzard Challenge and the Voice Conversion Challenge. Further, we show that the reliability of deep learning-based naturalness prediction can be improved by transfer learning from speech quality prediction models that are trained on objective POLQA scores. The proposed model is made publicly available and can, for example, be used to evaluate different TTS system configurations.

show abstract

Neural correlates of speech quality dimensions analyzed using electroencephalography (EEG)

et al. 2019

View full text Add to dashboard Cite

Objective. By means of subjective psychophysical methods, quality of transmitted speech has been decomposed into three perceptual dimensions named 'discontinuity' (F), 'noisiness' (N) and 'coloration' (C). Previous studies using electroencephalography (EEG) already reported effects of perceived intensity of single quality dimensions on electrical brain activity. However, it has not been investigated so far, whether the dimensions themselves are dissociable on a neurophysiological level of analysis. Approach. Pursuing this goal in the present study, a high-quality (HQ) recording of a spoken word was degraded on each dimension at a time, resulting in three quality-impaired stimuli (F, N, C) which were on average described as being equal in perceived degradation intensity. Participants performed a three-stimulus oddball task, involving the serial presentation of different stimulus types:(1) HQ or degraded 'standard' stimuli to establish sensory/perceptual quality references. (2) Degraded 'oddball' stimuli to cause random, infrequent deviations from those references. EEG was employed to examine the neuro-electrical correlates of speech quality perception. Main results. Emphasis was placed on modulations in temporal and morphological characteristics of the P300 component of the event-related brain potential (ERP), whose subcomponents P3a and P3b are commonly linked to attentional orienting and task relevance categorization, respectively. Electrophysiological data analysis ( N = 28) revealed significant modulations of P300 amplitude and latency by the perceptual dimensions underlying both quality references and oddball stimuli. Significance. The present study exemplifies the utility of physiological methods like EEG for dissociating speech degradations not only based on perceived intensity level, but also their distinctive quality dimension.

show abstract

P300 indicates context-dependent change in speech quality beyond phonological change

et al. 2019

View full text Add to dashboard Cite

Objective. Non-invasive physiological methods like electroencephalography (EEG) are increasingly employed to assess human information processing during exposure to multimedia signals. In the quality engineering field, previous research has promoted the utility of the P300 event-related brain potential (ERP) component for indicating variation in quality perception. The present study provides a starting point to test whether the P300 and its two subcomponents, P3a and P3b, are truly reflective of changes in the perceived quality of transmitted speech signals given the presence of other, quality-unrelated changes in acoustic stimulation. Approach. High-quality and degraded variants of spoken words were presented in a two-feature oddball task, which required participants to actively respond to rarely occurring ‘target’ stimuli within a series of frequent ‘standard’ stimuli, thereby eliciting ERP waveforms. Target presentations involved either single quality changes or concurrent double changes in quality and the initial phoneme. Main results. In case additional phonological change was present, only varying quality of standard stimuli caused significant modulations in P3a and P3b characteristics (N = 32). Thus, the formation of different short-term quality references exerted a persisting influence on the auditory processing of transmitted speech. Significance. The obtained results elucidate the importance of contextual and content-related influencing factors for proving the validity of the P300 as a psychophysiological indicator of speech quality change. Associated questions regarding the transfer of ERP-based quality assessment into more practically relevant measurement contexts are discussed.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Gabriel Mittag

NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets

Non-intrusive Speech Quality Assessment for Super-wideband Speech Communication Networks

Deep Learning Based Assessment of Synthetic Speech Naturalness

Neural correlates of speech quality dimensions analyzed using electroencephalography (EEG)

P300 indicates context-dependent change in speech quality beyond phonological change

Contact Info

Product

Resources

About