Abstract. Speech quality, as perceived by the users of Voice over Internet Protocol (VoIP) telephony, is critically important to the uptake of this service. VoIP quality can be degraded by network layer problems (delay, jitter, packet loss). This paper presents a method for real-time, non-intrusive speech quality estimation for VoIP that emulates the subjective listening quality measures based on Mean Opinion Scores (MOS). MOS provide the numerical indication of perceived quality of speech. We employ a Genetic Programming based symbolic regression approach to derive a speech quality estimation model. Our results compare favorably with the International Telecommunications Union-Telecommunication Standardization (ITU-T) PESQ algorithm which is the most widely accepted standard for speech quality estimation. Moreover, our model is suitable for real-time speech quality estimation of VoIP while PESQ is not. The performance of the proposed model was also compared to the new ITU-T recommendation P.563 for non-intrusive speech quality estimation and an improved performance was observed.
Voice over IP (VoIP) speech quality estimation is crucial to providing optimal Quality of Service (QoS). This paper seeks to provide improved speech quality estimation models with better prediction accuracy by considering a richer set of input features than the current International Telecommunications Union-Telecommunication (ITU-T) recommendations. It addresses a transitional phase, where wideband (WB) networks are becoming available. However, they have to co-exist with the existing narrowband (NB) setups for the time being. Quality estimation becomes a challenge in such a mixed context. The ITU-T recommendation (termed EModel) has recently been extended to deal with the mixed context. However, it evaluates the speech degradation in the WB scenario based solely on codec related distortions (only a subset of factors affecting the speech quality on a VoIP network). The extension is derived out of speech signals evaluated by human subjects: an expensive and difficult to reproduce exercise. This paper innovates by considering a number of other network distortion types as well to produce generalised models that predict the quality degradation to a higher accuracy. To this end, an extensive set of speech samples is subjected to a wide variety of distortions. The degraded signals are evaluated by the currently best available algorithmic approximation of human evaluation of speech to produce quality scores. Using the distortions as the input features and targeting the quality scores, we employ Genetic Programming to produce parsimonious models that show considerable prediction gain compared to the E-Model. As against some existing approaches, where the models are tailored to various telephony codecs, the evolved models generalise across a variety of modern codecs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.