Towards Unsupervised Speech Recognition and Synthesis with Quantized Speech Representation Learning

Liu, Alexander H.; Tu, Tao; Lee, Hung-yi; Lee, Lin-Shan

doi:10.48550/arxiv.1910.12729

Cited by 3 publications

(11 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To perform representation discretization, a learnable codebook E = (e1, e2, ..., eV ) of size V is maintained, where each ei ∈ R D is called a codeword. For an encoded frame-level representation sequence H, the closest codeword ev is used as a substitute for each representation ht, and this operation is called phonetic clustering [21]. The gradient of this non-differentiable operation is approximated by straight-through (ST) gradient estimator [22].…”

Section: Phonetic Encodermentioning

confidence: 99%

“…where the first term is the reconstruction loss of unpaired speech Xunpair, the second term is the CTC loss for Ypair, the last term is the TTS loss for target audio Xpair, and λ is fixed to be 10 throughout the end-to-end training process. For more details, please refer to the prior work [21].…”

Section: Speech Synthesizermentioning

confidence: 99%

“…Semi-supervised learning for TTS has shown remarkable results in single speaker synthesis [19,20,21], where unpaired text or speech data are utilized to help the model training. Ren et al [19] proposed to jointly train a phoneme recognition model and a speech synthesis model with unpaired data.…”

Section: Introductionmentioning

confidence: 99%

“…Chung et al [20] perform semi-supervised training on Tacotron [3] in a pretrain-finetune manner. Different from the pretrain-finetune method, Liu and Tu et al [21] utilize the unpaired data for TTS training in an end-to-end manner. They proposed Sequential Representation Quantization AutoEncoder (SeqRQ-AE) to learn discrete speech representation from a large amount of unpaired speech data.…”

Section: Introductionmentioning

confidence: 99%

“…Even though many efforts have been made to semisupervised learning for TTS, prior works [19,20,21] focused on single speaker TTS modeling and left multi-speaker TTS unstudied. Moreover, previous works [19,21] leverage only a large amount of unpaired speech from a single speaker which is also challenging to collect in practice.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Semi-Supervised Learning for Multi-Speaker Text-to-Speech Synthesis Using Discrete Speech Representation

Tu¹,

Chen²,

Liu³

et al. 2020

Interspeech 2020

Self Cite

View full text Add to dashboard Cite

Recently, end-to-end multi-speaker text-to-speech (TTS) systems gain success in the situation where a lot of high-quality speech plus their corresponding transcriptions are available. However, laborious paired data collection processes prevent many institutes from building multi-speaker TTS systems of great performance. In this work, we propose a semi-supervised learning approach for multi-speaker TTS. A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation. The experiment results demonstrate that with only an hour of paired speech data, no matter the paired data is from multiple speakers or a single speaker, the proposed model can generate intelligible speech in different voices. We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy. In addition, our analysis reveals that different speaker characteristics of the paired data have an impact on the effectiveness of semisupervised TTS.

show abstract

Section: Phonetic Encodermentioning

confidence: 99%

Section: Speech Synthesizermentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Semi-Supervised Learning for Multi-Speaker Text-to-Speech Synthesis Using Discrete Speech Representation

Tu¹,

Chen²,

Liu³

et al. 2020

Interspeech 2020

Self Cite

View full text Add to dashboard Cite

show abstract

One-Shot Voice Conversion by Vector Quantization

Lee

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Voice conversion (VC) is a task that transforms the source speaker's timbre, accent, and tones in audio into another one's while preserving the linguistic content. It is still a challenging work, especially in a one-shot setting. Auto-encoder-based VC methods disentangle the speaker and the content in input speech without given the speaker's identity, so these methods can further generalize to unseen speakers. The disentangle capability is achieved by vector quantization (VQ), adversarial training, or instance normalization (IN). However, the imperfect disentanglement may harm the quality of output speech. In this work, to further improve audio quality, we use the U-Net architecture within an auto-encoder-based VC system. We find that to leverage the U-Net architecture, a strong information bottleneck is necessary. The VQ-based method, which quantizes the latent vectors, can serve the purpose. The objective and the subjective evaluations show that the proposed method performs well in both audio naturalness and speaker similarity.

show abstract

UWSpeech: Speech to Speech Translation for Unwritten Languages

Zhang

Tan

Ren

et al. 2021

AAAI

View full text Add to dashboard Cite

Existing speech to speech translation systems heavily rely on the text of target language: they usually translate source language either to target text and then synthesize target speech from text, or directly to target speech with target text for auxiliary training. However, those methods cannot be applied to unwritten target languages, which have no written text or phoneme available. In this paper, we develop a translation system for unwritten languages, named as UWSpeech, which converts target unwritten speech into discrete tokens with a converter, and then translates source-language speech into target discrete tokens with a translator, and finally synthesizes target speech from target discrete tokens with an inverter. We propose a method called XL-VAE, which enhances vector quantized variational autoencoder (VQ-VAE) with cross-lingual (XL) speech recognition, to train the converter and inverter of UWSpeech jointly. Experiments on Fisher Spanish-English conversation translation dataset show that UWSpeech outperforms direct translation and VQ-VAE baseline by about 16 and 10 BLEU points respectively, which demonstrate the advantages and potentials of UWSpeech.

show abstract

Towards Unsupervised Speech Recognition and Synthesis with Quantized Speech Representation Learning

Cited by 3 publications

References 0 publications

Semi-Supervised Learning for Multi-Speaker Text-to-Speech Synthesis Using Discrete Speech Representation

Semi-Supervised Learning for Multi-Speaker Text-to-Speech Synthesis Using Discrete Speech Representation

One-Shot Voice Conversion by Vector Quantization

UWSpeech: Speech to Speech Translation for Unwritten Languages

Contact Info

Product

Resources

About