SUMMARYA vocoder-based speech synthesis system, named WORLD, was developed in an effort to improve the sound quality of realtime applications using speech. Speech analysis, manipulation, and synthesis on the basis of vocoders are used in various kinds of speech research. Although several high-quality speech synthesis systems have been developed, real-time processing has been difficult with them because of their high computational costs. This new speech synthesis system has not only sound quality but also quick processing. It consists of three analysis algorithms and one synthesis algorithm proposed in our previous research. The effectiveness of the system was evaluated by comparing its output with against natural speech including consonants. Its processing speed was also compared with those of conventional systems. The results showed that WORLD was superior to the other systems in terms of both sound quality and processing speed. In particular, it was over ten times faster than the conventional systems, and the real time factor (RTF) indicated that it was fast enough for real-time processing.
:In this paper, we carried out a subjective evaluation on the perceptual difference in female speech to show the gender difference in its likability and analyzed a relationship between the acoustic features and subjective scores. This subjective evaluation used female speech uttered by 21 speakers as the stimuli, and 127 subjects (47 males and 80 females) attended it. The results suggested that there was the speech preferred without the gender difference and preferred by one gender. We then analyzed the correlation between subjective scores and five acoustic features: fundamental frequency, formant frequency, amplitude difference, spectral centroid and spectral tilt. In female subjects, statistically significant correlations were observed in all features. In male subjects, significant correlation was observed only in spectral tilt. In particular, correlation in spectral tilt showed the inverse trend between male and female subjects. These results suggest that the spectral tilt is effective in the gender difference.
The sound quality of speech synthesized using modern speech synthesis systems is expected to be approximated to human speech. We investigated the effect of temporal fluctuation of speech on the perception of humanness. Speech stimuli used in the evaluation were generated by voice morphing. The morphing source (morphing rate of 0%) and target (morphing rate of 100%) were speech without temporal fluctuation and original speech, respectively. There were three kinds of morphing factors: fundamental frequency (F0), spectral envelope (SP), and both (F0 + SP). We used MUSHRA as the evaluation method involving two speakers (one male and one female), two phonemes (/a/ and /i/), and two F0s (high and low). Nine morphing rates (every 25% from -100 to 100%) were used, and ten subjects with normal hearing participated in the evaluation. The results show that the stimuli with a morphing rate of 0% scored the lowest humanness for all factors. The stimuli with morphing rates of -100% scored significantly lower than those with a morphing rate of 100% in SP and F0 + SP. The most dominant factor was F0 + SP, and the effect of F0 was the smallest of all factors. [Work supported by JSPS KAKENHI Grant Numbers 15H02726, 16H05899, 16H01734.]
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.