VQVAE Unsupervised Unit Discovery and Multi-Scale Code2Spec Inverter for Zerospeech Challenge 2019

Tjandra, Andros; Şişman, Berrak; Zhang, Mingyang; Sakti, Sakriani; Li, Haizhou; Nakamura, Satoshi

doi:10.21437/interspeech.2019-3232

Cited by 53 publications

(55 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since the speech utterances for the sentences are unavailable, we generated sentences with Google text-to-speech API for all languages pairs. Even though the lack of natural speech dataset in this paper, VQ-VAE and codebook inverter can be applied and has shown a great performance on multispeaker natural speech [14,13]. Some papers [30,31,32] also show the performance improvement from the synthetic dataset can be carried over to the real dataset.…”

Section: Datasetmentioning

confidence: 97%

Speech-to-Speech Translation Between Untranscribed Unknown Languages

Tjandra

Sakti

Nakamura

2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

Self Cite

View full text Add to dashboard Cite

In this paper, we explore a method for training speech-to-speech translation tasks without any transcription or linguistic supervision. Our proposed method consists of two steps: First, we train and generate discrete representation with unsupervised term discovery with a discrete quantized autoencoder. Second, we train a sequence-tosequence model that directly maps the source language speech to the target languages discrete representation. Our proposed method can directly generate target speech without any auxiliary or pre-training steps with a source or target transcription. To the best of our knowledge, this is the first work that performed pure speech-to-speech translation between untranscribed unknown languages.

show abstract

Section: Datasetmentioning

confidence: 97%

Speech-to-Speech Translation Between Untranscribed Unknown Languages

Tjandra

Sakti

Nakamura

2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Machines can directly get discrete segments by applying such clustering algorithms as K-means [12], [11], GMM [11], or DPGMM clustering [13], [1], [14] from the acoustic features. The DPGMM algorithm [28] retained the state-ofthe-art approach in the Zerospeech 2015 and 2017 [15], [16].…”

Section: B Dpgmm-rnn Model and Phoneme Categorizationmentioning

confidence: 99%

“…However, low-dimensional continuous features are never as efficient as discrete features or discrete segments. The Vector Quantised-Variational AutoEncoder (VQ-VAE) can quantize speech acoustic features [11].…”

Section: Functional Load and Economical Principlementioning

confidence: 99%

“…Recently the Zero Resource Speech Challenge [6] was organized to compare the performance of these methods. Typical methods include neural network technology, such as representation learning by autoencoder [7], [8], [9] or discriminative training by ABnet [10], traditional clustering such as GMM [11] or k-means [12], [11], and nonparametric clustering such as the Dirichlet Process Gaussian Mixture Model (DPGMM) trained by Gibbs sampling [1], or variational inference [13], [14]. Among them, DPGMM, which is acoustic clustering, achieved the top performance at Zerospeech 2015 and 2017 [15], [16].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Tackling Perception Bias in Unsupervised Phoneme Discovery Using DPGMM-RNN Hybrid Model and Functional Load

Sakti

Zhang

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

The human perception of phonemes is biased against speech sounds. The lack of correspondence between perceputal phonemes and acoustic signals forms a big challenge in designing unsupervised algorithms to distinguish phonemes from sound. We propose the DPGMM-RNN hybrid model that improves phoneme categorization by relieving the fragmentation problem. We also merge segments with low functional load, which is the work done by segment contrasts to differentiate between utterances, just like humans who convert unambiguous segments into phonemes as units for immediate perception. Our results show that the DPGMM-RNN hybrid model relieves the fragmentation problem and improves phoneme discriminability. The minimal functional load merge compresses a segment system, preserves information and keeps phoneme discriminability.

show abstract

“…For the conventional Gaussian mixture model (GMM) approach, non-parallel VC can be adapted from a pretrained parallel VC in the model space using the maximum a posterior (MAP) method [7,8] or as interpolation between multiple parallel models [9]. For recent neural network approaches, a non-parallel VC can be trained by directly using an intermediate linguistic representation extracted from an automatic speech recognition (ASR) model [10,11] or by indirectly encouraging the network to disentangle linguistic information from the speaker characteristics using methods like variational autoencoder (VAE) [12], generative adversarial networks (GAN) [13,14] or some other techniques [15]. For both parallel and non-parallel VC, the systems usually change the voice but are unable to change the duration of the utterance.…”

Section: Introductionmentioning

confidence: 99%

Bootstrapping Non-Parallel Voice Conversion from Speaker-Adaptive Text-to-Speech

Luong

Yamagishi

2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Voice conversion (VC) and text-to-speech (TTS) are two tasks that share a similar objective, generating speech with a target voice. However, they are usually developed independently under vastly different frameworks. In this paper, we propose a methodology to bootstrap a VC system from a pretrained speaker-adaptive TTS model and unify the techniques as well as the interpretations of these two tasks. Moreover by offloading the heavy data demand to the training stage of the TTS model, our VC system can be built using a small amount of target speaker speech data. It also opens up the possibility of using speech in a foreign unseen language to build the system. Our subjective evaluations show that the proposed framework is able to not only achieve competitive performance in the standard intra-language scenario but also adapt and convert using speech utterances in an unseen language.

show abstract

VQVAE Unsupervised Unit Discovery and Multi-Scale Code2Spec Inverter for Zerospeech Challenge 2019

Cited by 53 publications

References 26 publications

Speech-to-Speech Translation Between Untranscribed Unknown Languages

Speech-to-Speech Translation Between Untranscribed Unknown Languages

Tackling Perception Bias in Unsupervised Phoneme Discovery Using DPGMM-RNN Hybrid Model and Functional Load

Bootstrapping Non-Parallel Voice Conversion from Speaker-Adaptive Text-to-Speech

Contact Info

Product

Resources

About