CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks

Kaneko, Takuhiro; Kameoka, Hirokazu

doi:10.23919/eusipco.2018.8553236

Cited by 298 publications

(239 citation statements)

References 41 publications

Supporting

Mentioning

235

Contrasting

Order By: Relevance

“…A deep bidirectional long short-term memory based recurrent neural networks was proposed by [21] to improve the naturalness of voice conversion models. To overcome the need for parallel data in building conversion function for VC models [22] proposed a novel architecture that uses a cycle-consistent adversarial network.…”

Section: Vowels and Prosody Contribution In Neural Network Based Voicmentioning

confidence: 99%

Vowels and Prosody Contribution in Neural Network Based Voice Conversion Algorithm with Noisy Training Data

Agbolade¹

2020

EJERS

View full text Add to dashboard Cite

Abstract-This research presents a neural network based voice conversion (VC) model. While it is a known fact that voiced sounds and prosody are the most important component of the voice conversion framework, what is not known is their objective contributions particularly in a noisy and uncontrolled environment. This model uses a 2-layer feedforward neural network to map the Linear prediction analysis coefficients of a source speaker to the acoustic vector space of the target speaker with a view to objectively determine the contributions of the voiced, unvoiced and supra-segmental components of sounds to the voice conversion model. Results showed that vowels "a", "i", "o" have the most significant contribution in the conversion success. The voiceless sounds were also found to be most affected by the noisy training data. An average noise level of 40 dB above the noise floor were found to degrade the voice conversion success by 55.14 percent relative to the voiced sounds. The result also shows that for cross-gender voice conversion, prosody conversion is more significant in scenarios where a female is the target speaker.

show abstract

Section: Vowels and Prosody Contribution In Neural Network Based Voicmentioning

confidence: 99%

Vowels and Prosody Contribution in Neural Network Based Voice Conversion Algorithm with Noisy Training Data

Agbolade¹

2020

EJERS

View full text Add to dashboard Cite

show abstract

“…While these systems offer the advantage of being able to generate novel TTS voice samples given a few seconds of reference audio, the quality of TTS is inferior [25] compared to single-speaker TTS models. In our system, we employ another recent work [14] that uses a CycleGAN architecture to achieve good voice transfer between two human speakers with no loss in linguistic features. We train this model to perform a cross-language transfer of a synthetic TTS voice to a natural target speaker voice.…”

Section: Voice Transfer In Audiomentioning

confidence: 99%

“…As our TTS model only generates audio samples in a single voice, we personalize this voice to match the voice of different target speakers. As collecting parallel training data for the same speaker across languages is infeasible, we adopt the CycleGAN architecture [14] to work around this problem.…”

Section: Personalizing Speaker Voicementioning

confidence: 99%

“…For each speaker, we train a CycleGAN for about 50K iterations with a batch size of 16. The other hyperparameters are the same as used in Kaneko and Kameoka [14]. During inference, given a TTS generated audio sample, the model preserves the linguistic features and generates speech in the voice of the speaker it was trained on.…”

Section: Personalizing Speaker Voicementioning

confidence: 99%

“…We adapt state-of-the-art neural machine translation and textto-speech models [25,31] to work for Indian languages and generate translated speech in language L B . We also personalize the voice [14] 1 https://merchdope.com/youtube-stats/ generated by the TTS model to closely match the voice of the target speaker.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Towards Automatic Face-to-Face Translation

Prajwal

Mukhopadhyay

Philip

et al. 2019

Proceedings of the 27th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Figure 1: In light of the increasing amount of audio-visual content in our digital communication, we examine the extent to which current translation systems handle the different modalities in such media. We extend the existing systems that can only provide textual transcripts or translated speech for talking face videos to also translate the visual modality i.e. lip and mouth movements. Consequently, our proposed pipeline produces fully translated talking face videos with corresponding lip synchronization. ABSTRACTIn light of the recent breakthroughs in automatic machine translation systems, we propose a novel approach that we term as "Faceto-Face Translation". As today's digital communication becomes increasingly visual, we argue that there is a need for systems that can automatically translate a video of a person speaking in language A into a target language B with realistic lip synchronization. In this work, we create an automatic pipeline for this problem and demonstrate its impact in multiple real-world applications. First, we

show abstract