Quality prediction of synthesized speech based on tensor structured EEG signals

Maki, Hayato; Sakti, Sakriani; Tanaka, Hiroki; Nakamura, Satoshi

doi:10.1371/journal.pone.0193521

Cited by 7 publications

(3 citation statements)

References 48 publications

(42 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This study also found which frequency band is useful in order to reduce the complexity of models which will shorten the processing time. In comparison with [9], our study tried to generalize the approach across the subjects while the previous work was done within subject. Therefore, our approach may reduce the prediction performance.…”

Section: Discussionmentioning

confidence: 99%

“…In [8], they proposed brain computer interface-based equation to predict quality of experience MOS, and achieved 1.00 of root mean squared error (RMSE) between actual and predicted MOS. In addition, by using tensor representation of all channels and all frequency bands, a study conducted by [9] shows that EEG signals could be used to predict MOS, valence, and arousal within the same subject. We also previously examined which EEG electrodes, frequency bands, and time length significantly represent perceived speech quality in Japanese using the generalized fisher scores [10].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Speech Quality Evaluation of Synthesized Japanese Speech Using EEG

Parmonangan¹,

Tanaka

Sakti

et al. 2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

As synthesized speech technology becomes more widely used, the synthesized speech quality must be assessed to ensure that it is acceptable. Subjective evaluation metrics, such as mean opinion score (MOS), can only provide an overall impression without any further detailed information about the speech. Therefore, this study proposes predicting speech quality using electroencephalographs (EEG), which are more objective and have high temporal resolution. In this paper, we use one natural speech and four types of synthesized speech lasting two to six seconds. First, to obtain ground truth of MOS, we gathered ten subjects to give opinion score on a scale of one to five for each recording. Second, another nine subjects were asked to measure how close to natural speech each synthesized speech sounded. The subjects' EEGs were recorded while they were listening to and evaluating the listened speech. The best accuracy achieved for classification was 96.61% using support vector machine, 80.36% using linear discriminant analysis, and 59.9% using logistic regression. For regression, we achieved root mean squared error as low as 1.133 using SVR and 1.353 using linear regression. This study demonstrates that EEG could be used to evaluate the perceived speech quality objectively.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Speech Quality Evaluation of Synthesized Japanese Speech Using EEG

Parmonangan¹,

Tanaka

Sakti

et al. 2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

show abstract

“…Apart from the naturalness and understandability of contents, listening tests can also measure the distinguishability of characters or the degree of entertainment [3]. The subjective scales for rating the synthesized speech may include only a few scored parameters, such as an overall impression by a mean opinion score (MOS) describing the perceived speech quality from poor to excellent, a valence from negative to positive, and an arousal from unexcited to excited [4]. The MOS scale can be used not only for naturalness, but for different dimensions, such as affect (from negative to positive) or speaking style (from irritated to calm) as well [5].…”

Section: Introductionmentioning

confidence: 99%

GMM-Based Evaluation of Synthetic Speech Quality Using 2D Classification in Pleasure-Arousal Scale

Přibil

Přibilová

Matoušek³

2020

Applied Sciences

View full text Add to dashboard Cite

The paper focuses on the description of a system for the automatic evaluation of synthetic speech quality based on the Gaussian mixture model (GMM) classifier. The speech material originating from a real speaker is compared with synthesized material to determine similarities or differences between them. The final evaluation order is determined by distances in the Pleasure-Arousal (P-A) space between the original and synthetic speech using different synthesis and/or prosody manipulation methods implemented in the Czech text-to-speech system. The GMM models for continual 2D detection of P-A classes are trained using the sound/speech material from the databases without any relation to the original speech or the synthesized sentences. Preliminary and auxiliary analyses show a substantial influence of the number of mixtures, the number and type of the speech features used the size of the processed speech material, as well as the type of the database used for the creation of the GMMs on the P-A classification process and on the final evaluation result. The main evaluation experiments confirm the functionality of the system developed. The objective evaluation results obtained are principally correlated with the subjective ratings of human evaluators; however, partial differences were indicated, so a subsequent detailed investigation must be performed.

show abstract

Common brain activity features discretization for predicting perceived speech quality

Parmonangan

2023

Procedia Computer Science

View full text Add to dashboard Cite

Quality prediction of synthesized speech based on tensor structured EEG signals

Cited by 7 publications

References 48 publications

Speech Quality Evaluation of Synthesized Japanese Speech Using EEG

Speech Quality Evaluation of Synthesized Japanese Speech Using EEG

GMM-Based Evaluation of Synthetic Speech Quality Using 2D Classification in Pleasure-Arousal Scale

Common brain activity features discretization for predicting perceived speech quality

Contact Info

Product

Resources

About