Building a voice conversion (VC) system for a new target speaker typically requires a large amount of speech data from the target speaker. This paper investigates a method to build a VC system for arbitrary target speaker using one given utterance without any adaptation training process. Inspired by global style tokens (GSTs), which recently has been shown to be effective in controlling the style of synthetic speech, we propose the use of global speaker embeddings (GSEs) to control the conversion target of the VC system. Speaker-independent phonetic posteriorgrams (PPGs) are employed as the local condition input to a conditional WaveNet synthesizer for waveform generation of the target speaker. Meanwhile, spectrograms are extracted from the given utterance and fed into a reference encoder, the generated reference embedding is then employed as attention query to the GSEs to produce the speaker embedding, which is employed as the global condition input to the WaveNet synthesizer to control the generated waveform's speaker identity. In experiments, when compared with an adaptation training based any-to-any VC system, the proposed GSEs based VC approach performs equally well or better in both speech naturalness and speaker similarity, with apparently higher flexibility to the comparison.
Neural text-to-speech (TTS) approaches generally require a huge number of high quality speech data, which makes it difficult to obtain such a dataset with extra emotion labels. In this paper, we propose a novel approach for emotional TTS synthesis on a TTS dataset without emotion labels. Specifically, our proposed method consists of a cross-domain speech emotion recognition (SER) model and an emotional TTS model. Firstly, we train the cross-domain SER model on both SER and TTS datasets. Then, we use emotion labels on the TTS dataset predicted by the trained SER model to build an auxiliary SER task and jointly train it with the TTS model. Experimental results show that our proposed method can generate speech with the specified emotional expressiveness and nearly no hindering on the speech quality.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.