An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

Şişman, Berrak; Yamagishi, Junichi; King, Simon; Li, Haizhou

doi:10.1109/taslp.2020.3038524

Cited by 207 publications

(105 citation statements)

References 224 publications

(294 reference statements)

Supporting

Mentioning

104

Contrasting

Unclassified

Order By: Relevance

“…We conduct objective evaluation to assess the performance of our proposed model. We calculate Mel-cepstral distortion (MCD) [7,4] to measure the spectral distortion between the converted and reference Mel-spectrum for two male and two female speakers for three emotion combinations.…”

Section: Objective Evaluationmentioning

confidence: 99%

See 1 more Smart Citation

Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset

Zhou

Şişman

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. Prior studies show that it is possible to disentangle emotional prosody using an encoder-decoder network conditioned on discrete representation, such as one-hot emotion labels. Such networks learn to remember a fixed set of emotional styles. In this paper, we propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN), which makes use of a pre-trained speech emotion recognition (SER) model to transfer emotional style during training and at run-time inference. In this way, the network is able to transfer both seen and unseen emotional style to a new utterance. We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework. This paper also marks the release of an emotional speech dataset (ESD) for voice conversion, which has multiple speakers and languages.

show abstract

Section: Objective Evaluationmentioning

confidence: 99%

“…Emotional voice conversion and speech voice conversion [4] differs in many ways. Speech voice conversion aims to change the speaker identity, whereas emotional voice conversion focuses on the emotional state transfer.…”

Section: Introductionmentioning

confidence: 99%

Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset

Zhou

Şişman

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…In a recent VC review paper [48], it was shown that a sufficient amount of efforts has been dedicated to transferring knowledge from ASR and TTS to improving various aspects of VC, regardless of using a seq2seq model or not. The PPGbased methods [49]- [53] and the Parratron system described in Section II-A facilitated nonparallel, any-to-one VC by utilizing ASR and TTS modules, respectively.…”

Section: Transfer Learning From Asr and Tts For Vcmentioning

confidence: 99%

Pretraining Techniques for Sequence-to-Sequence Voice Conversion

Huang

Hayashi

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

“…Aging upward or downward, or changing the perception of one’s gender, can similarly allow people to influence how they are received by their counterparts. Using deepfakes, it is possible to change a person’s identifying attributes such as skin tone, hair color, gender (Lu, Tai, and Tang 2018), age (Antipov, Baccouche, and Dugelay 2017), accent, and speech pattern (Sisman et al 2020). Many of these deepfakes can already be generated on the fly, and it is only a matter of time before all such conversions are possible in real time.…”

Section: Deepfakes: Covert Changes In Audio‐visual Cuesmentioning

confidence: 99%

Technology‐Driven Alteration of Nonverbal Cues and its Effects on Negotiation

Baten¹,

Hoque²

2021

Negotiation Journal

View full text Add to dashboard Cite

A person’s appearance, identity, and other nonverbal cues can substantially influence how one is perceived by a negotiation counterpart, potentially impacting the outcome of the negotiation. With recent advances in technology, it is now possible to alter such cues through real‐time video communication. In many cases, a person’s physical presence can explicitly be replaced by 2D/3D representations in live interactive media. In other cases, technologies such as deepfake can subtly and implicitly alter many nonverbal cues—including a person’s appearance and identity—in real time. In this article, we look at some state‐of‐the‐art technological advances that can enable such explicit and implicit alterations of nonverbal cues. We also discuss the implications of such technology for the negotiation landscape and highlight ethical considerations that warrant deep, ongoing attention from stakeholders.

show abstract

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

Cited by 207 publications

References 224 publications

Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset

Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset

Pretraining Techniques for Sequence-to-Sequence Voice Conversion

Technology‐Driven Alteration of Nonverbal Cues and its Effects on Negotiation

Contact Info

Product

Resources

About