Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-2760
|View full text |Cite
|
Sign up to set email alerts
|

DNN No-Reference PSTN Speech Quality Prediction

Abstract: Classic public switched telephone networks (PSTN) are often a black box for VoIP network providers, as they have no access to performance indicators, such as delay or packet loss. Only the degraded output speech signal can be used to monitor the speech quality of these networks. However, the current state-of-the-art speech quality models are not reliable enough to be used for live monitoring. One of the reasons for this is that PSTN distortions can be unique depending on the provider and country, which makes i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 15 publications
(7 citation statements)
references
References 31 publications
0
7
0
Order By: Relevance
“…training set file was rated by 5 participants, while the test set files were rated by 30 participants to ensure a low confidence interval of the MOS values for the model evaluation. For more details, please refer to [30].…”
Section: Pstn Corpusmentioning
confidence: 99%
“…training set file was rated by 5 participants, while the test set files were rated by 30 participants to ensure a low confidence interval of the MOS values for the model evaluation. For more details, please refer to [30].…”
Section: Pstn Corpusmentioning
confidence: 99%
“…Such datasets include more than 200 hours of speech samples degraded with common degradations experienced over conferencing applications plus speech synthesis and voice conversion samples. This study considered the synthetised speech dataset VoiceMOS [7], plus four sets of datasets with speech conferencing distortions, namely: Tencent (2 datasets), NISQA (7 datasets) [8], IU-Bloomington (2 datasets) [13,14] and PSTN [15]. These datasets were used to build subsets to fine-tune the wav2vec 2.0 pre-trained model targeting specific speech scenarios.…”
Section: Datasetsmentioning
confidence: 99%
“…The experiment seeks to understand the level of influence that fine-tuning data size has on speech quality predictions. Labelled speech datasets containing conferencing degradations (Tencent, IU Bloomington [13,14], NISQA Corpus [8], and PSTN [15]) and synthesised speech (VoiceMOS [7]) were used to build the fine-tuning datasets and test the resulting models.…”
Section: Introductionmentioning
confidence: 99%
“…As machine learning (ML) has become more powerful and accessible, numerous research groups have sought to apply ML to develop NR tools [17]- [50]. Some of these NR tools produce estimates of subjective test scores that report speech or sound quality mean opinion score (MOS) [17]- [19], [24]- [27], [30], [35], [38], [40], [41], [46], [47], naturalness [28], [34], [36], listening effort [23], noise intrusiveness [47], and speech intelligibility [20], [32]. The non-intrusive speech quality assessment model called NISQA [50] produces estimates of subjective speech quality as well as four constituent dimensions: noisiness, coloration, discontinuity, and loudness.…”
Section: A Existing Machine Learning Approachesmentioning
confidence: 99%