Impact of the Number of Votes on the Reliability and Validity of Subjective Speech Quality Assessment in the Crowdsourcing Approach

Naderi, Babak; Hosfeld, Tobias; Hirth, Matthias; Metzger, Florian; Möller, Sebastian; Jiménez, Rafael Zequeira

doi:10.1109/qomex48832.2020.9123115

Cited by 8 publications

(4 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The larger number of simulation runs leads to a smother scatter plot and a more accurate fit. In our previous paper [40] we used 200 runs, although similar results were observed there the fitted functions only showed minor changes. Increasing the number of runs also leads to smaller confidence interval widths.…”

Section: Discussion and Future Worksupporting

confidence: 60%

See 1 more Smart Citation

Towards speech quality assessment using a crowdsourcing approach: evaluation of standardized methods

et al. 2020

Self Cite

View full text Add to dashboard Cite

Subjective speech quality assessment has traditionally been carried out in laboratory environments under controlled conditions. With the advent of crowdsourcing platforms tasks, which need human intelligence, can be resolved by crowd workers over the Internet. Crowdsourcing also offers a new paradigm for speech quality assessment, promising higher ecological validity of the quality judgments at the expense of potentially lower reliability. This paper compares laboratory-based and crowdsourcing-based speech quality assessments in terms of comparability of results and efficiency. For this purpose, three pairs of listening-only tests have been carried out using three different crowdsourcing platforms and following the ITU-T Recommendation P.808. In each test, listeners judge the overall quality of the speech sample following the Absolute Category Rating procedure. We compare the results of the crowdsourcing approach with the results of standard laboratory tests performed according to the ITU-T Recommendation P.800. Results show that in most cases, both paradigms lead to comparable results. Notable differences are discussed with respect to their sources, and conclusions are drawn that establish practical guidelines for crowdsourcing-based speech quality assessment.

show abstract

Section: Discussion and Future Worksupporting

confidence: 60%

“…Here we use simulations and matrices mentioned in Chapter 5. Some parts of these results have been published in [40]. In this paper, we extend them by considering larger simulation runs, more QoE metrics, and a method for aggregating result of all metrics.…”

Section: Resultsmentioning

confidence: 99%

Towards speech quality assessment using a crowdsourcing approach: evaluation of standardized methods

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…A previous study showed that the ITU-T Rec. P.808 provides a valid and reliable approach for speech quality assessment in crowdsourcing [5]. We provide an open-source implementation 1 of the ITU-T Rec.…”

Section: Introductionmentioning

confidence: 99%

An Open source Implementation of ITU-T Recommendation P.808 with Validation

Naderi,

Cutler

2020

Preprint

Self Cite

View full text Add to dashboard Cite

The ITU-T Recommendation P.808 provides a crowdsourcing approach for conducting a subjective assessment of speech quality using the Absolute Category Rating (ACR) method.We provide an open-source implementation of the ITU-T Rec. P.808 that runs on the Amazon Mechanical Turk platform. We extended our implementation to include Degradation Category Ratings (DCR) and Comparison Category Ratings (CCR) test methods. We also significantly speed up the test process by integrating the participant qualification step into the main rating task compared to a two-stage qualification and rating solution. We provide program scripts for creating and executing the subjective test, and data cleansing and analyzing the answers to avoid operational errors. To validate the implementation, we compare the Mean Opinion Scores (MOS) collected through our implementation with MOS values from a standard laboratory experiment conducted based on the ITU-T Rec. P.800. We also evaluate the reproducibility of the result of the subjective speech quality assessment through crowdsourcing using our implementation. Finally, we quantify the impact of parts of the system designed to improve the reliability: environmental tests, gold and trapping questions, rating patterns, and a headset usage test.

show abstract

“…It is also recommended to remove submissions, which show specific patterns in the ratings, or which are flagged by outlier detection methods. Although previous works showed that applying the best practices produce highly reliable and valid measurements in multiple studies (with some variations between them) [8], [9], there is no guaranty or method to evaluate that in absence of ground truth.…”

mentioning

confidence: 99%