In this paper, we present an update to the NISQA speech quality prediction model that is focused on distortions that occur in communication networks. In contrast to the previous version, the model is trained end-to-end and the time-dependency modelling and time-pooling is achieved through a Self-Attention mechanism. Besides overall speech quality, the model also predicts the four speech quality dimensions Noisiness, Coloration, Discontinuity, and Loudness, and in this way gives more insight into the cause of a quality degradation. Furthermore, new datasets with over 13,000 speech files were created for training and validation of the model. The model was finally tested on a new, live-talking test dataset that contains recordings of real telephone calls. Overall, NISQA was trained and evaluated on 81 datasets from different sources and showed to provide reliable predictions also for unknown speech samples. The code, model weights, and datasets are open-sourced.
In this paper, we present a new objective prediction model for synthetic speech naturalness. It can be used to evaluate Text-To-Speech or Voice Conversion systems and works language independently. The model is trained end-to-end and based on a CNN-LSTM network that previously showed to give good results for speech quality estimation. We trained and tested the model on 16 different datasets, such as from the Blizzard Challenge and the Voice Conversion Challenge. Further, we show that the reliability of deep learning-based naturalness prediction can be improved by transfer learning from speech quality prediction models that are trained on objective POLQA scores. The proposed model is made publicly available and can, for example, be used to evaluate different TTS system configurations.
Objective. By means of subjective psychophysical methods, quality of transmitted speech has been decomposed into three perceptual dimensions named 'discontinuity' (F), 'noisiness' (N) and 'coloration' (C). Previous studies using electroencephalography (EEG) already reported effects of perceived intensity of single quality dimensions on electrical brain activity. However, it has not been investigated so far, whether the dimensions themselves are dissociable on a neurophysiological level of analysis. Approach. Pursuing this goal in the present study, a high-quality (HQ) recording of a spoken word was degraded on each dimension at a time, resulting in three quality-impaired stimuli (F, N, C) which were on average described as being equal in perceived degradation intensity. Participants performed a three-stimulus oddball task, involving the serial presentation of different stimulus types:(1) HQ or degraded 'standard' stimuli to establish sensory/perceptual quality references. (2) Degraded 'oddball' stimuli to cause random, infrequent deviations from those references. EEG was employed to examine the neuro-electrical correlates of speech quality perception. Main results. Emphasis was placed on modulations in temporal and morphological characteristics of the P300 component of the event-related brain potential (ERP), whose subcomponents P3a and P3b are commonly linked to attentional orienting and task relevance categorization, respectively. Electrophysiological data analysis ( N = 28) revealed significant modulations of P300 amplitude and latency by the perceptual dimensions underlying both quality references and oddball stimuli. Significance. The present study exemplifies the utility of physiological methods like EEG for dissociating speech degradations not only based on perceived intensity level, but also their distinctive quality dimension.
Objective. Non-invasive physiological methods like electroencephalography (EEG) are increasingly employed to assess human information processing during exposure to multimedia signals. In the quality engineering field, previous research has promoted the utility of the P300 event-related brain potential (ERP) component for indicating variation in quality perception. The present study provides a starting point to test whether the P300 and its two subcomponents, P3a and P3b, are truly reflective of changes in the perceived quality of transmitted speech signals given the presence of other, quality-unrelated changes in acoustic stimulation. Approach. High-quality and degraded variants of spoken words were presented in a two-feature oddball task, which required participants to actively respond to rarely occurring ‘target’ stimuli within a series of frequent ‘standard’ stimuli, thereby eliciting ERP waveforms. Target presentations involved either single quality changes or concurrent double changes in quality and the initial phoneme. Main results. In case additional phonological change was present, only varying quality of standard stimuli caused significant modulations in P3a and P3b characteristics (N = 32). Thus, the formation of different short-term quality references exerted a persisting influence on the auditory processing of transmitted speech. Significance. The obtained results elucidate the importance of contextual and content-related influencing factors for proving the validity of the P300 as a psychophysiological indicator of speech quality change. Associated questions regarding the transfer of ERP-based quality assessment into more practically relevant measurement contexts are discussed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.