Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

Saito, Yuki; Takamichi, Shinnosuke; Saruwatari, Hiroshi

doi:10.1109/taslp.2017.2761547

Cited by 195 publications

(87 citation statements)

References 39 publications

(61 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The V2S attack follows the former and tries to reproduce the targeted speaker's voice from the ASV model. A similar idea was used in Saito et al's work [6] that incorporated a voice anti-spoofing (i.e., a discriminative model to detect spoofing attacks) into training of a VC model for reproducing fine structures of the synthesized voice.…”

Section: Discussionmentioning

confidence: 99%

V2S attack: building DNN-based voice conversion from automatic speaker verification

Nakamura¹,

Saito²,

Takamichi³

et al. 2019

10th ISCA Workshop on Speech Synthesis (SSW 10)

Self Cite

View full text Add to dashboard Cite

This paper presents a new voice impersonation attack using voice conversion (VC). Enrolling personal voices for automatic speaker verification (ASV) offers natural and flexible biometric authentication systems. Basically, the ASV systems do not include the users' voice data. However, if the ASV system is unexpectedly exposed and hacked by a malicious attacker, there is a risk that the attacker will use VC techniques to reproduce the enrolled user's voices. We name this the "verificationto-synthesis (V2S) attack" and propose VC training with the ASV and pre-trained automatic speech recognition (ASR) models and without the targeted speaker's voice data. The VC model reproduces the targeted speaker's individuality by deceiving the ASV model and restores phonetic property of an input voice by matching phonetic posteriorgrams predicted by the ASR model. The experimental evaluation compares converted voices between the proposed method that does not use the targeted speaker's voice data and the standard VC that uses the data. The experimental results demonstrate that the proposed method performs comparably to the existing VC methods that trained using a very small amount of parallel voice data.

show abstract

Section: Discussionmentioning

confidence: 99%

V2S attack: building DNN-based voice conversion from automatic speaker verification

Nakamura¹,

Saito²,

Takamichi³

et al. 2019

10th ISCA Workshop on Speech Synthesis (SSW 10)

Self Cite

View full text Add to dashboard Cite

show abstract

“…SS is now able to generate high-quality voice due to recent advances in unit selection [45], statistical parametric [46], hybrid [47], and DNN-based TTS methods. Recently, deep learning-based techniques, such as Generative Adversarial Network (GAN) [48], Tacotron [49], Wavenet [50], etc., are able to produce very natural sounding speech both in timbre and prosody. SS uses properties of a claimed speaker's voice characteristics and spectral cues of the natural speech.…”

Section: B) Synthetic Speechmentioning

confidence: 99%

Advances in anti-spoofing: from the perspective of ASVspoof challenges

Kamble

Sailor

Patil

et al. 2020

SIP

View full text Add to dashboard Cite

In recent years, automatic speaker verification (ASV) is used extensively for voice biometrics. This leads to an increased interest to secure these voice biometric systems for real-world applications. The ASV systems are vulnerable to various kinds of spoofing attacks, namely, synthetic speech (SS), voice conversion (VC), replay, twins, and impersonation. This paper provides the literature review of ASV spoof detection, novel acoustic feature representations, deep learning, end-to-end systems, etc. Furthermore, the paper also summaries previous studies of spoofing attacks with emphasis on SS, VC, and replay along with recent efforts to develop countermeasures for spoof speech detection (SSD) task. The limitations and challenges of SSD task are also presented. While several countermeasures were reported in the literature, they are mostly validated on a particular database, furthermore, their performance is far from perfect. The security of voice biometrics systems against spoofing attacks remains a challenging topic. This paper is based on a tutorial presented at APSIPA Annual Summit and Conference 2017 to serve as a quick start for those interested in the topic.

show abstract

“…In addition to the inverter, we also have a discriminator module. The discriminator predicts whether the given spectrogram is real data or is generated by the inverter, which generates a realistic spectrogram to deceive the discriminator [14,15,16]. The Code2Spec inverter has several training objectives:…”

Section: Vector Quantized Variational Autoencoder (Vq-vae)mentioning

confidence: 99%

VQVAE Unsupervised Unit Discovery and Multi-Scale Code2Spec Inverter for Zerospeech Challenge 2019

et al. 2019

View full text Add to dashboard Cite

We describe our submitted system for the ZeroSpeech Challenge 2019. The current challenge theme addresses the difficulty of constructing a speech synthesizer without any text or phonetic labels and requires a system that can (1) discover subword units in an unsupervised way, and (2) synthesize the speech with a target speaker's voice. Moreover, the system should also balance the discrimination score ABX, the bit-rate compression rate, and the naturalness and the intelligibility of the constructed voice. To tackle these problems and achieve the best tradeoff, we utilize a vector quantized variational autoencoder (VQ-VAE) and a multi-scale codebook-tospectrogram (Code2Spec) inverter trained by mean square error and adversarial loss. The VQ-VAE extracts the speech to a latent space, forces itself to map it into the nearest codebook and produces compressed representation. Next, the inverter generates a magnitude spectrogram to the target voice, given the codebook vectors from VQ-VAE. In our experiments, we also investigated several other clustering algorithms, including K-Means and GMM, and compared them with the VQ-VAE result on ABX scores and bit rates. Our proposed approach significantly improved the intelligibility (in CER), the MOS, and discrimination ABX scores compared to the official ZeroSpeech 2019 baseline or even the topline.

show abstract

Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

Cited by 195 publications

References 39 publications

V2S attack: building DNN-based voice conversion from automatic speaker verification

V2S attack: building DNN-based voice conversion from automatic speaker verification

Advances in anti-spoofing: from the perspective of ASVspoof challenges

VQVAE Unsupervised Unit Discovery and Multi-Scale Code2Spec Inverter for Zerospeech Challenge 2019

Contact Info

Product

Resources

About