Perceptual evaluation of singing quality

Gupta, Chitralekha; Li, Haizhou; Wang, Ye

doi:10.1109/apsipa.2017.8282110

Cited by 28 publications

(50 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…where σ S and σ T are the standard deviations of signals S and T , respectively. Besides PCC, we also examine the Frame Disturbance between the converted prosody and the reference [41,42] . We first perform dynamic programming (DTW) to obtain the frame alignment between the original target and converted F0 contour, and calculate the number [25] and PSR [26]) and the traditional linear F0 conversion.…”

Section: Methodsmentioning

confidence: 99%

Phonetically Aware Exemplar-Based Prosody Transformation

Şişman¹,

Lee²,

Li³

2018

The Speaker and Language Recognition Workshop (Odyssey 2018)

Self Cite

View full text Add to dashboard Cite

In this paper, we propose a novel prosody transformation framework for voice conversion by making use of phonetic information. The proposed framework is motivated by two observations. Firstly, the phonetic prosody is an important aspect of speech prosody, that is influenced by the phonetic content of utterances. We propose the use of phone-dependent dictionaries, or phonetic dictionary, that allows for effective phonetic prosody conversion. Secondly, in the traditional exemplar-based sparse representation frameworks, the estimated activation matrix highly depends on the source speech that is not the best for generating target speech. We propose to incorporate Phonetic PosteriorGrams (PPGs), that represent frame-level phonetic information, as part of the exemplars of the dictionaries. As the exemplars now consist of PPGs that are expected to be speaker-independent, the resulting activation matrix depends less on the source speaker, thus represents a better transformation function for prosody transformation. The experiments show that the proposed prosody transformation framework outperforms the traditional frameworks in both objective and subjective evaluations.

show abstract

Section: Methodsmentioning

confidence: 99%

Phonetically Aware Exemplar-Based Prosody Transformation

Şişman¹,

Lee²,

Li³

2018

The Speaker and Language Recognition Workshop (Odyssey 2018)

Self Cite

View full text Add to dashboard Cite

show abstract

“…That is, a higher weighting is used for localized distortions in PESQ pesnq apsipa transactions score computation. Motivated by this approach, we applied this concept of audio quality perception for singing quality assessment in our previous work [8] to obtain a novel PESQ-like singing quality score. PESQ combines the frame-level disturbance values of a degraded audio with respect to the original audio by computing the L 6 norm over split-second intervals, i.e.…”

Section: ) Cognitive Modeling: Localized Versus Distributed Errorsmentioning

confidence: 99%

“…The value of p in L p norm is higher for averaging over split-second intervals, to give more weight to localized disturbances than distributed disturbances. In our previous work [8], we applied the same idea of L 6 and L 2 norm to the frame disturbances computed from the dynamic time warping (DTW) optimal path deviation from the diagonal, for a test singing with respect to the reference singing. We applied it to different pitch and rhythm acoustic features, as will be discussed in Section III.4.…”

Section: ) Cognitive Modeling: Localized Versus Distributed Errorsmentioning

confidence: 99%

“…The idea of this method is to combine the acoustic features directly to predict human overall singing quality judgment score. This method is the standard way of computing the overall singing quality judgment score as reported in [4,5,8,15]. In our previous work [8], we generated the overall singing quality judgment score from a linear combination of the cognitive model-based and distance-based acoustic features.…”

Section: ) Early Fusionmentioning

confidence: 99%

“…It is expected that the volume variations across time for different singers performing the same song will show a similar pattern, thus is used as a common acoustic cue in existing systems [4,6,7]. In [8], we implemented the perceptual feature for volume as the DTW distance of the short-term log energy between the reference and the test (termed as volume_dist) for evaluation.…”

Section: ) Volumementioning

confidence: 99%

See 2 more Smart Citations

A technical framework for automatic perceptual evaluation of singing quality

Gupta

Wang

2018

SIP

Self Cite

View full text Add to dashboard Cite

Human experts evaluate singing quality based on many perceptual parameters such as intonation, rhythm, and vibrato, with reference to music theory. We proposed previously the Perceptual Evaluation of Singing Quality (PESnQ) framework that incorporated acoustic features related to these perceptual parameters in combination with the cognitive modeling concept of the telecommunication standard Perceptual Evaluation of Speech Quality to evaluate singing quality. In this study, we present further the study of the PESnQ framework to approximate the human judgments. First, we find that a linear combination of the individual perceptual parameter human scores can predict their overall singing quality judgment. This provides us with a human parametric judgment equation. Next, the prediction of the individual perceptual parameter scores from the PESnQ acoustic features show a high correlation with the respective human scores, which means more meaningful feedback to learners. Finally, we compare the performance of early fusion and late fusion of the acoustic features in predicting the overall human scores. We find that the late fusion method is superior to that of the early fusion method. This work underlines the importance of modeling human perception in automatic singing quality assessment.

show abstract