Describing Subjective Experiment Consistency by p-Value P--P Plot

Nawała, Jakub; Janowski, Lucjan; Ćmiel, Bogdan; Rusek, Krzysztof

doi:10.1145/3394171.3413749

Cited by 10 publications

(15 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is worth mentioning that we already proposed in the past a tool based on the GSD class. The tool extends possible ways to validate data consistency of responses obtained during a MQA subjective experiment [8]. Importantly, our work was noticed by practitioners in the MQA field and referred to in [9], [10], and [11].…”

Section: Introductionmentioning

confidence: 78%

“…The black line is the upper bound of 95% right-sided confidence interval for the CDF of p-values under the null hypothesis. Specifically, under the null hypothesis, the CDF of p-values is not greater than the uniform distribution function (for more details see [8]). As one can see, there is no evidence that the GSD is not the correct way of modelling subjective responses from MQA experiments.…”

Section: A Comparing Goodness-of-fit Of Ordered Probit and Gsd For Mu...mentioning

confidence: 99%

See 1 more Smart Citation

Generalised Score Distribution: Underdispersed Continuation of the Beta-Binomial Distribution

Ćmiel¹,

Nawała²,

Janowski³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

A class of discrete probability distributions contains distributions with limited support. A typical example is some variant of a Likert scale, with response mapped to either the {1, 2, . . . , 5} or {−3, −2, . . . , 2, 3} set. An interesting subclass of discrete distributions with finite support are distributions limited to two parameters and having no more than one change in probability monotonicity. The main contribution of this paper is to propose a family of distributions fitting the above description, which we call the Generalised Score Distribution (GSD) class. The proposed GSD class covers the whole set of possible mean and variances, for any fixed and finite support. Furthermore, the GSD class can be treated as an underdispersed continuation of a reparametrized beta-binomial distribution. The GSD class parameters are intuitive and can be easily estimated by the method of moments. We also offer a Maximum Likelihood Estimation (MLE) algorithm for the GSD class and evidence that the class properly describes response distributions coming from 24 Multimedia Quality Assessment experiments. At last, we show that the GSD class can be represented as a sum of dichotomous zero-one random variables, which points to an interesting interpretation of the class.

show abstract

Section: Introductionmentioning

confidence: 78%

Section: A Comparing Goodness-of-fit Of Ordered Probit and Gsd For Mu...mentioning

confidence: 99%

Generalised Score Distribution: Underdispersed Continuation of the Beta-Binomial Distribution

Ćmiel¹,

Nawała²,

Janowski³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Fig. 2 from [1]). Effectively, recreating these results is the most significant part of the reproducibility efforts.…”

Section: $ P Y T H O N 3 R E P R O D U C E Py −Hmentioning

confidence: 96%

“…Another two important files in the repo are: (i) subjective_qua-lity_datasets.csv and (ii) G_test_results.csv. The former one includes raw subjective data that is processed in the original paper [1]. The most important output of this processing is the G_test_results.csv file.…”

Section: $ P Y T H O N 3 R E P R O D U C E Py −Hmentioning

confidence: 99%

Describing Subjective Experiment Consistency by p-Value P--P Plot

Nawała

Janowski

Ćmiel

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

There are phenomena that cannot be measured without subjective testing. However, subjective testing is a complex issue with many influencing factors. These interplay to yield either precise or incorrect results. Researchers require a tool to classify results of subjective experiment as either consistent or inconsistent. This is necessary in order to decide whether to treat the gathered scores as quality ground truth data. Knowing if subjective scores can be trusted is key to drawing valid conclusions and building functional tools based on those scores (e.g., algorithms assessing the perceived quality of multimedia materials). We provide a tool to classify subjective experiment (and all its results) as either consistent or inconsistent. Additionally, the tool identifies stimuli having irregular score distribution. The approach is based on treating subjective scores as a random variable coming from the discrete Generalized Score Distribution (GSD). The GSD, in combination with a bootstrapped G-test of goodness-of-fit, allows to construct-value P-P plot that visualizes experiment's consistency. The tool safeguards researchers from using inconsistent subjective data. In this way, it makes sure

show abstract

“…Modelling the individual listener score will allow for the model to be able to take into account this rater data, accounting for the rater bias. Additionally, many researchers have pointed out issues with using MOS as the primary quality metric and have proposed alternatives [20]- [22]. Modeling the individual rater score allows using the model for other metrics that are alternatives or complements to MOS.…”

Section: Introductionmentioning

confidence: 99%

Marginal Effects of Language and Individual Raters on Speech Quality Models

Chinen

2021

IEEE Access

View full text Add to dashboard Cite

Speech quality is often measured via subjective testing, or with objective estimators of mean opinion score (MOS) such as ViSQOL or POLQA. Typical MOS-estimation frameworks use signal level features but do not use language features that have been shown to have an effect on opinion scores. If there is a conditional dependence between score and language given these signal features, introducing language and rater predictors should provide a marginal improvement in predictions. The proposed method uses Bayesian models that predict the individual opinion score instead of MOS. Several models that test various combinations of predictors were used, including predictors that capture signal features, such as frequency band similarity, as well as features that are related to the listener, such as a language and rater index. The models are fit to the ITU-T P. Supplement 23 dataset, and posterior samples are drawn from distributions of both the model parameters and the resulting opinion score outcomes. These models are compared to MOS models by integrating over posterior samples per utterance. An experiment was conducted by ablating different predictors for several types of Bayesian hierarchical models (including ordered logistic and truncated normal individual score distributions, as well as MOS distributions) to find the marginal improvement of language and rater. The models that included language and/or rater obtained significantly lower errors (0.601 versus 0.684 root-mean-square error (RMSE)) and higher correlation. Additionally, individual rater models matched or exceeded the performance of MOS models.

show abstract

Describing Subjective Experiment Consistency by p-Value P--P Plot

Cited by 10 publications

References 36 publications

Generalised Score Distribution: Underdispersed Continuation of the Beta-Binomial Distribution

Generalised Score Distribution: Underdispersed Continuation of the Beta-Binomial Distribution

Describing Subjective Experiment Consistency by p-Value P--P Plot

Marginal Effects of Language and Individual Raters on Speech Quality Models

Contact Info

Product

Resources

About