ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric

Chinen, Michael; Lim, Felicia S. C.; Skoglund, Jan; Gureev, Nikita; O'Gorman, Feargus; Hines, Andrew

doi:10.1109/qomex48832.2020.9123150

Cited by 68 publications

(38 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These are compared via an adaptation of the structural similarity index, originally developed for evaluating the quality of compressed images and then adapted to predict intelligibility [47]. Version 3 was recently released [48], [49] and it is here referred to as ViSQOLAudioV3. The declared aim for this new version is to "fill the blind spots in the training/validation datasets" so as to have a more general system that would perform better "in the wild".…”

Section: G Visqolaudiomentioning

confidence: 99%

Objective Measures of Perceptual Audio Quality Reviewed: An Evaluation of Their Application Domain Dependence

Torcoli

Kastner

Herre

2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Over the past few decades, computational methods have been developed to estimate perceptual audio quality. These methods, also referred to as objective quality measures, are usually developed and intended for a specific application domain. Because of their convenience, they are often used outside their original intended domain, even if it is unclear whether they provide reliable quality estimates in this case. This work studies the correlation of well-known state-of-the-art objective measures with human perceptual scores in two different domains: audio coding and source separation. The following objective measures are considered: fwS-NRseg, dLLR, PESQ, PEAQ, POLQA, PEMO-Q, ViSQOLAudio, (SI-)BSSEval, PEASS, LKR-PI, 2f-model, and HAAQI. Additionally, a novel measure (SI-SA2f) is presented, based on the 2f-model and a BSSEval-based signal decomposition. We use perceptual scores from 7 listening tests about audio coding and 7 listening tests about source separation as ground-truth data for the correlation analysis. The results show that one method (2f-model) performs significantly better than the others on both domains and indicate that the dataset for training the method and a robust underlying auditory model are crucial factors towards a universal, domainindependent objective measure.

show abstract

Section: G Visqolaudiomentioning

confidence: 99%

Objective Measures of Perceptual Audio Quality Reviewed: An Evaluation of Their Application Domain Dependence

Torcoli

Kastner

Herre

2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…For the purposes of this study, this 'language' feature will be a laboratory identifier where the native language is used to test, and also encompasses other factors in the entire test environment such as the culture of the laboratory, the listening equipment, and so on. Each rater and language identifier is used as an index variable with normal priors that linearly influence the φ offset for the ordered logit model, and an exponential model for NSIM, as was found to be useful in [7]. The prior for φ i can be described for individual observation i, rater j, and language k, and NSIM observation x i as…”

Section: E Features and Parametersmentioning

confidence: 99%

Marginal Effects of Language and Individual Raters on Speech Quality Models

Chinen

2021

IEEE Access

Self Cite

View full text Add to dashboard Cite

Speech quality is often measured via subjective testing, or with objective estimators of mean opinion score (MOS) such as ViSQOL or POLQA. Typical MOS-estimation frameworks use signal level features but do not use language features that have been shown to have an effect on opinion scores. If there is a conditional dependence between score and language given these signal features, introducing language and rater predictors should provide a marginal improvement in predictions. The proposed method uses Bayesian models that predict the individual opinion score instead of MOS. Several models that test various combinations of predictors were used, including predictors that capture signal features, such as frequency band similarity, as well as features that are related to the listener, such as a language and rater index. The models are fit to the ITU-T P. Supplement 23 dataset, and posterior samples are drawn from distributions of both the model parameters and the resulting opinion score outcomes. These models are compared to MOS models by integrating over posterior samples per utterance. An experiment was conducted by ablating different predictors for several types of Bayesian hierarchical models (including ordered logistic and truncated normal individual score distributions, as well as MOS distributions) to find the marginal improvement of language and rater. The models that included language and/or rater obtained significantly lower errors (0.601 versus 0.684 root-mean-square error (RMSE)) and higher correlation. Additionally, individual rater models matched or exceeded the performance of MOS models.

show abstract

“…Different kinds of objective models exist depending on the speech applications and services. Models such as POLQA [2], PESQ [3], and ViSQOL [4,5] have been shown to work well for a wide variety of coding, channel and environmental degradations to the speech signal. They are full-reference (FR) metrics that compare a clean reference to a test signal that has been degraded.…”

Section: Introductionmentioning

confidence: 99%

“…They pre-align the signals in order to account for quality issues resulting from delay and signal corruption. For example, the ViSQOL metric [4,5] uses the neurogram similarity index measure (NSIM) to estimate the similarity between a pre-aligned reference patch and a degraded spectrogram patch frame by frame.…”

Section: Introductionmentioning

confidence: 99%

Warp-Q: Quality Prediction for Generative Neural Speech Codecs

Jassim

Skoglund

Chinen

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Good speech quality has been achieved using waveform matching and parametric reconstruction coders. Recently developed very low bit rate generative codecs can reconstruct high quality wideband speech with bit streams less than 3 kb/s. These codecs use a DNN with parametric input to synthesise high quality speech outputs. Existing objective speech quality models (e.g., POLQA, ViSQOL) do not accurately predict the quality of coded speech from these generative models underestimating quality due to signal differences not highlighted in subjective listening tests. We present WARP-Q, a full-reference objective speech quality metric that uses dynamic time warping cost for MFCC speech representations. It is robust to small perceptual signal changes. Evaluation using waveform matching, parametric and generative neural vocoder based codecs as well as channel and environmental noise shows that WARP-Q has better correlation and codec quality ranking for novel codecs compared to traditional metrics in addition to versatility for general quality assessment scenarios.

show abstract

ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric

Cited by 68 publications

References 17 publications

Objective Measures of Perceptual Audio Quality Reviewed: An Evaluation of Their Application Domain Dependence

Objective Measures of Perceptual Audio Quality Reviewed: An Evaluation of Their Application Domain Dependence

Marginal Effects of Language and Individual Raters on Speech Quality Models

Warp-Q: Quality Prediction for Generative Neural Speech Codecs

Contact Info

Product

Resources

About