In this paper, we present an update to the NISQA speech quality prediction model that is focused on distortions that occur in communication networks. In contrast to the previous version, the model is trained end-to-end and the time-dependency modelling and time-pooling is achieved through a Self-Attention mechanism. Besides overall speech quality, the model also predicts the four speech quality dimensions Noisiness, Coloration, Discontinuity, and Loudness, and in this way gives more insight into the cause of a quality degradation. Furthermore, new datasets with over 13,000 speech files were created for training and validation of the model. The model was finally tested on a new, live-talking test dataset that contains recordings of real telephone calls. Overall, NISQA was trained and evaluated on 81 datasets from different sources and showed to provide reliable predictions also for unknown speech samples. The code, model weights, and datasets are open-sourced.
In this paper, we present a new objective prediction model for synthetic speech naturalness. It can be used to evaluate Text-To-Speech or Voice Conversion systems and works language independently. The model is trained end-to-end and based on a CNN-LSTM network that previously showed to give good results for speech quality estimation. We trained and tested the model on 16 different datasets, such as from the Blizzard Challenge and the Voice Conversion Challenge. Further, we show that the reliability of deep learning-based naturalness prediction can be improved by transfer learning from speech quality prediction models that are trained on objective POLQA scores. The proposed model is made publicly available and can, for example, be used to evaluate different TTS system configurations.
Objective. By means of subjective psychophysical methods, quality of transmitted speech has been decomposed into three perceptual dimensions named 'discontinuity' (F), 'noisiness' (N) and 'coloration' (C). Previous studies using electroencephalography (EEG) already reported effects of perceived intensity of single quality dimensions on electrical brain activity. However, it has not been investigated so far, whether the dimensions themselves are dissociable on a neurophysiological level of analysis. Approach. Pursuing this goal in the present study, a high-quality (HQ) recording of a spoken word was degraded on each dimension at a time, resulting in three quality-impaired stimuli (F, N, C) which were on average described as being equal in perceived degradation intensity. Participants performed a three-stimulus oddball task, involving the serial presentation of different stimulus types:(1) HQ or degraded 'standard' stimuli to establish sensory/perceptual quality references. (2) Degraded 'oddball' stimuli to cause random, infrequent deviations from those references. EEG was employed to examine the neuro-electrical correlates of speech quality perception. Main results. Emphasis was placed on modulations in temporal and morphological characteristics of the P300 component of the event-related brain potential (ERP), whose subcomponents P3a and P3b are commonly linked to attentional orienting and task relevance categorization, respectively. Electrophysiological data analysis ( N = 28) revealed significant modulations of P300 amplitude and latency by the perceptual dimensions underlying both quality references and oddball stimuli. Significance. The present study exemplifies the utility of physiological methods like EEG for dissociating speech degradations not only based on perceived intensity level, but also their distinctive quality dimension.
Objective. Non-invasive physiological methods like electroencephalography (EEG) are increasingly employed to assess human information processing during exposure to multimedia signals. In the quality engineering field, previous research has promoted the utility of the P300 event-related brain potential (ERP) component for indicating variation in quality perception. The present study provides a starting point to test whether the P300 and its two subcomponents, P3a and P3b, are truly reflective of changes in the perceived quality of transmitted speech signals given the presence of other, quality-unrelated changes in acoustic stimulation. Approach. High-quality and degraded variants of spoken words were presented in a two-feature oddball task, which required participants to actively respond to rarely occurring ‘target’ stimuli within a series of frequent ‘standard’ stimuli, thereby eliciting ERP waveforms. Target presentations involved either single quality changes or concurrent double changes in quality and the initial phoneme. Main results. In case additional phonological change was present, only varying quality of standard stimuli caused significant modulations in P3a and P3b characteristics (N = 32). Thus, the formation of different short-term quality references exerted a persisting influence on the auditory processing of transmitted speech. Significance. The obtained results elucidate the importance of contextual and content-related influencing factors for proving the validity of the P300 as a psychophysiological indicator of speech quality change. Associated questions regarding the transfer of ERP-based quality assessment into more practically relevant measurement contexts are discussed.
Communication service providers use specialized solutions to evaluate the quality of their services. Also, different mechanisms that increase network robustness are incorporated in current communication systems. One of the most accepted techniques to improve transmission performance is the MIMO system. In communication services, voice quality is important to determine the user's quality of experience. Nowadays, different speech quality assessment methods are used, one of them is the parametric method that is used for network planning purposes. The ITU-T Rec. G107.1 is the most accepted model for wide-band communication systems. However, it does not consider the degradations occurring in a wireless network nor the quality improvement, caused by the MIMO systems. Thus, we propose a speech quality model, based on wireless parameters, such as signal-to-noise ratio, Doppler shift, MIMO configurations, and different modulation schemes. Also, real speech signals encoded by 3 operation modes of the AMR-WB codec are used in test scenarios. The resulting speech samples were evaluated by the algorithm described in ITU-T Rec. P.862.2, which scores are used as a reference. With these results, a wireless function, named I W −M that relates the wireless network parameters with speech quality is proposed and inserted into the wide-band E-model algorithm. It is worth noting that the main novelty of the proposed I W −M is the consideration of the quality improvement incorporated by MIMO systems with different antenna array configurations. The performance validation results demonstrated that the inclusion of I W −M values into the R global score determined a reliable model, reaching a Pearson correlation coefficient and a normalized RMSE of 0.976 and 0.144, respectively. INDEX TERMS Speech quality, wireless communications, MIMO systems, antenna arrays, fading channels, parametric models, E-model.
In this paper, we present a full-reference speech quality prediction model with a deep learning approach. The model determines a feature representation of the reference and the degraded signal through a siamese recurrent convolutional network that shares the weights for both signals as input. The resulting features are then used to align the signals with an attention mechanism and are finally combined to estimate the overall speech quality. The proposed network architecture represents a simple solution for the time-alignment problem that occurs for speech signals transmitted through Voice-Over-IP networks and shows how the clean reference signal can be incorporated into speech quality models that are based on end-to-end trained neural networks.
Classic public switched telephone networks (PSTN) are often a black box for VoIP network providers, as they have no access to performance indicators, such as delay or packet loss. Only the degraded output speech signal can be used to monitor the speech quality of these networks. However, the current state-of-the-art speech quality models are not reliable enough to be used for live monitoring. One of the reasons for this is that PSTN distortions can be unique depending on the provider and country, which makes it difficult to train a model that generalizes well for different PSTN networks. In this paper, we present a new open-source PSTN speech quality test set with over 1000 crowdsourced real phone calls. Our proposed noreference model outperforms the full-reference POLQA and noreference P.563 on the validation and test set. Further, we analyzed the influence of file cropping on the perceived speech quality and the influence of the number of ratings and training size on the model accuracy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.