Can we steal your vocal identity from the Internet?: Initial investigation of cloning Obama’s voice using GAN, WaveNet and low-quality found data

Lorenzo-Trueba, Jaime; Fang, Fuming; Wang, Xin; Echizen, Isao; Yamagishi, Junichi; Kinnunen, Tomi

doi:10.21437/odyssey.2018-34

Cited by 59 publications

(37 citation statements)

References 21 publications

(29 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other risks could include fabricating a 'digital clone' of someone using machine learning -recent warning examples are provided by the so-called deepfakes [26,27,28], realistic-appearing but fabricated or tampered videos portraying a targeted person created with the aid of deep learning (the interested reader is pointed to to [29] for a detailed review of potential societal, ethical and legal implications of deepfakes). In the context of speaker verification in specific, [30] addressed voice cloning of a well-known celebrity (the former US president Barack Obama). Even if the result was essentially negative (the cloned voice samples were detectable as artificial ones using a spoofing countermeasure), machine learning, including voice cloning techniques, do not stand still.…”

Section: Attacks On Speaker Verification Systems With Found Datamentioning

confidence: 99%

Voice Mimicry Attacks Assisted by Automatic Speaker Verification

Vestman

Kinnunen

Hautamäki

et al. 2020

Computer Speech & Language

Self Cite

View full text Add to dashboard Cite

In this work, we simulate a scenario, where a publicly available ASV system is used to enhance mimicry attacks against another closed source ASV system. In specific, ASV technology is used to perform a similarity search between the voices of recruited attackers (6) and potential target speakers (7,365) from VoxCeleb corpora to find the closest targets for each of the attackers. In addition, we consider 'median', 'furthest', and 'common' targets to serve as a reference points.Our goal is to gain insights how well similarity rankings transfer from the attacker's ASV system to the attacked ASV system, whether the attackers are able to improve their attacks by mimicking, and how the properties of the voices of attackers change due to mimicking. We address these questions through ASV experiments, listening tests, and prosodic and formant analyses. For the ASV experiments, we use i-vector technology in the attacker side, and x-vectors in the attacked side. For the listening tests, we recruit listeners through crowdsourcing.The results of the ASV experiments indicate that the speaker similarity scores transfer well from one ASV system to another. Both the ASV experiments and the listening tests reveal that the mimicry attempts do not, in general, help in bringing attacker's scores closer to the target's. A detailed analysis shows that mimicking does not improve attacks, when the natural voices of attackers and targets are similar to each other. The analysis of prosody and formants suggests that the attackers were able to considerably change their speaking rates when mimicking, but the changes in F0 and formants were modest. Overall, the results suggest that untrained impersonators do not pose a high threat towards ASV systems, but the use of ASV systems to attack other ASV systems is a potential threat.

show abstract

Section: Attacks On Speaker Verification Systems With Found Datamentioning

confidence: 99%

Voice Mimicry Attacks Assisted by Automatic Speaker Verification

Vestman

Kinnunen

Hautamäki

et al. 2020

Computer Speech & Language

Self Cite

View full text Add to dashboard Cite

show abstract

“…Recent research on end-to-end text-to-speech (TTS) [1,2,3,4,5,6] has gained success in terms of human-like and highquality generated speech. Moreover, with regard to cloning prosody style or speaker characteristics, end-to-end TTS systems also demonstrate a powerful capability [7,8,9,10,11]. However, training end-to-end TTS systems requires large quantities of text-audio paired data.…”

Section: Introductionmentioning

confidence: 99%

End-to-End Text-to-Speech for Low-Resource Languages by Cross-Lingual Transfer Learning

Chen²,

Yeh

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

End-to-end text-to-speech (TTS) has shown great success on large quantities of paired text plus speech data. However, laborious data collection remains difficult for at least 95% of the languages over the world, which hinders the development of TTS in different languages. In this paper, we aim to build TTS systems for such low-resource (target) languages where only very limited paired data are available. We show such TTS can be effectively constructed by transferring knowledge from a high-resource (source) language. Since the model trained on source language cannot be directly applied to target language due to input space mismatch, we propose a method to learn a mapping between source and target linguistic symbols. Benefiting from this learned mapping, pronunciation information can be preserved throughout the transferring procedure. Preliminary experiments show that we only need around 15 minutes of paired data to obtain a relatively good TTS system. Furthermore, analytic studies demonstrated that the automatically discovered mapping correlate well with the phonetic expertise.

show abstract

“…In line with the recent EU's General Data Protection Regulation (GDPR), intended to protect the privacy of its citizens, it is important to assess risks associated with multimedia data in the public domain. A recent study [17] has attempted voice cloning of celebrity voices based on found data using a pre-defined target speaker. The cloned voice samples were, however, detectable as spoofed speech.…”

Section: Introductionmentioning

confidence: 99%

Can We Use Speaker Recognition Technology to Attack Itself? Enhancing Mimicry Attacks Using Automatic Target Speaker Selection

Kinnunen

Hautamäki

Vestman

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

We consider technology-assisted mimicry attacks in the context of automatic speaker verification (ASV). We use ASV itself to select targeted speakers to be attacked by human-based mimicry. We recorded 6 naive mimics for whom we select target celebrities from VoxCeleb1 and VoxCeleb2 corpora (7,365 potential targets) using an i-vector system. The attacker attempts to mimic the selected target, with the utterances subjected to ASV tests using an independently developed x-vector system. Our main finding is negative: even if some of the attacker scores against the target speakers were slightly increased, our mimics did not succeed in spoofing the xvector system. Interestingly, however, the relative ordering of the selected targets (closest, furthest, median) are consistent between the systems, which suggests some level of transferability between the systems.

show abstract

Can we steal your vocal identity from the Internet?: Initial investigation of cloning Obama’s voice using GAN, WaveNet and low-quality found data

Cited by 59 publications

References 21 publications

Voice Mimicry Attacks Assisted by Automatic Speaker Verification

Voice Mimicry Attacks Assisted by Automatic Speaker Verification

End-to-End Text-to-Speech for Low-Resource Languages by Cross-Lingual Transfer Learning

Can We Use Speaker Recognition Technology to Attack Itself? Enhancing Mimicry Attacks Using Automatic Target Speaker Selection

Contact Info

Product

Resources

About