In conventional speech synthesis, large amounts of phonetically balanced speech data recorded in highly controlled recording studio environments are typically required to build a voice. Although using such data is a straightforward solution for high quality synthesis, the number of voices available will always be limited, because recording costs are high. On the other hand, our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an "average voice model" plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack phonetic balance. This enables us to consider building high-quality voices on "non-TTS" corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper, we demonstrate the thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal (WSJ0, WSJ1, and WSJCAM0), Resource Management, Globalphone, and SPEECON databases.
This paper presents a first prototype of a virtual Theremin instrument for accompanying film scenes with sound. The virtual Theremin is implemented as a hybrid application for the web. Sound control is achieved by capturing user gestures with a webcam and mapping the gestures to the corresponding virtual Theremin parameters pitch and volume. Different sound types can be selected. The application’s underlying research is part of the multi-modal digital heritage project KOLLISIONEN which targets to open up the private archive of the Russian film maker Sergej Eisenstein to a broader public in digital form. Eisenstein, a film theorist and pioneer of film montage, was particularly intrigued by the Theremin as an instrument for film sound design. The virtual Theremin presented here is therefore linked to a film scene from the 1929 Soviet drama “The General Line” by Sergej Eisenstein which was never set to music originally. In its first implementation state, the application connects music interaction design with digital heritage in a modular, flexible and playful way and uses contemporary web technologies to enable easy operation and the greatest possible accessibility.
In speaker-adaptive HMM-based speech synthesis, there are typically a few speakers for which the output synthetic speech sounds worse than that of other speakers, despite having the same amount of adaptation data from within the same corpus. This paper investigates these fluctuations in quality and concludes that as mel-cepstral distance from the average voice becomes larger, the MOS naturalness scores generally become worse. Although this negative correlation is not that strong, it suggests a way to improve the training and adaptation strategies. We also draw comparisons between our findings and the work of other researchers regarding "vocal attractiveness."
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.