Parallel inference of dirichlet process Gaussian mixture models for unsupervised acoustic modeling: a feasibility study

Chen, Hongjie; Leung, Cheung-Chi; Xie, Lei; Ma, Bin; Li, Haizhou

doi:10.21437/interspeech.2015-642

Cited by 49 publications

(42 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To evaluate the accuracy of the speech translation, following the practice in [15], we pre-train an automatic speech recognition model (which can achieve 85.62 BLEU points on our test set and is comparable with [15]) to generate the corresponding text of the translated speech, and then calculate the BLEU score [29] between the generated text and the reference text. We report BLEU score using case insensitive BLEU with moses tokenizer 6 and multi-bleu.perl 7 . Due to the Fisher corpus has 4 English references in the test set, we report 4-reference BLEU score for Spanish to English setting, and still report single-reference BLEU score for English to Spanish setting.…”

Section: Discussionmentioning

confidence: 99%

“…A variety of previous works [1,6,10,11,16,25,35,44] have investigated the conversion between speech and their corresponding phonetic categories (discrete tokens) in an unsupervised manner, which mimics the way that human infants learn acoustic models in their mother tongue during their early years of life [39]. Among these works, vector quantized variational autoencoder (VQ-VAE) [3,7,9,22,[34][35][36] has been widely adopted and shown advantages over other methods.…”

Section: Introductionmentioning

confidence: 99%

“…Discrete Speech Representations Learning discrete representations of speech has long been studied for better speech understanding and modeling. Previous works on discrete speech representations include k-means clustering [10,16], Gaussian mixture model clustering [6], tree-based clustering [25], binarization with straight-through estimation [11], categorical VAE [11] and the more advanced vector quantized VAE (VQ-VAE) [3,7,9,22,[34][35][36]. VQ-VAE has been widely used to cluster/quantize the representations of speech and discretize into codebook sequence, and has achieved good results on some task such as subword units discovery from speech or text to speech synthesis [9].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

UWSpeech: Speech to Speech Translation for Unwritten Languages

Zhang¹,

Tan²,

Ren³

et al. 2020

Preprint

View full text Add to dashboard Cite

Existing speech to speech translation systems heavily rely on the text of target language: they usually translate source language either to target text and then synthesize target speech from text, or directly to target speech with target text for auxiliary training. However, those methods cannot be applied to unwritten target languages, which have no written text or phoneme available. In this paper, we develop a translation system for unwritten languages, named as UWSpeech, which converts target unwritten speech into discrete tokens with a converter, and then translates source-language speech into target discrete tokens with a translator, and finally synthesizes target speech from target discrete tokens with an inverter. We propose a method called XL-VAE, which enhances vector quantized variational autoencoder (VQ-VAE) with cross-lingual (XL) speech recognition, to train the converter and inverter of UWSpeech jointly. Experiments on Fisher Spanish-English conversation translation dataset show that UWSpeech outperforms direct translation and VQ-VAE baseline by about 16 and 10 BLEU points respectively, which demonstrate the advantages and potentials of UWSpeech.Preprint. Under review.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

UWSpeech: Speech to Speech Translation for Unwritten Languages

Zhang¹,

Tan²,

Ren³

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Previously investigated approaches can be divided into two categories, namely bottom-up modeling and top-down modeling. In the bottom-up approach, speech is viewed as a sequence of low-level components, e.g., frames or segments, which can be grouped by clustering techniques to define higher-level structures [12]- [14]. The learned clusters are regarded as the basic units to represent the language concerned.…”

Section: A Unsupervised Acoustic Modelingmentioning

confidence: 99%

Unsupervised Pattern Discovery from Thematic Speech Archives Based on Multilingual Bottleneck Features

Sung

Feng

Lee

2018

2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

View full text Add to dashboard Cite

The present study tackles the problem of automatically discovering spoken keywords from untranscribed audio archives without requiring word-byword speech transcription by automatic speech recognition (ASR) technology. The problem is of practical significance in many applications of speech analytics, including those concerning low-resource languages, and large amount of multilingual and multi-genre data. We propose a twostage approach, which comprises unsupervised acoustic modeling and decoding, followed by pattern mining in acoustic unit sequences. The whole process starts by deriving and modeling a set of subword-level speech units with untranscribed data. With the unsupervisedly trained acoustic models, a given audio archive is represented by a pseudo transcription, from which spoken keywords can be discovered by string mining algorithms. For unsupervised acoustic modeling, a deep neural network trained by multilingual speech corpora is used to generate speech segmentation and compute bottleneck features for segment clustering. Experimental results show that the proposed system is able to effectively extract topic-related words and phrases from the lecture recordings on MIT OpenCourseWare.

show abstract

“…The top performances in discovering speech representation in ZeroSpeech 2015 and 2017 are dominated by a Bayesian non-parametric approach with unsupervised cluster speech features using a Dirichlet process Gaussian mixture model (DPGMM) [4,5]. However, the DPGMM model is too sensitive to acoustic variations and often produces too many subword units and a relatively high-dimensional posteriogram, which implies high computational cost for learning and inference as well as more tendencies for overfitting [6].…”

Section: Introductionmentioning

confidence: 99%

VQVAE Unsupervised Unit Discovery and Multi-Scale Code2Spec Inverter for Zerospeech Challenge 2019

et al. 2019

Self Cite

View full text Add to dashboard Cite

We describe our submitted system for the ZeroSpeech Challenge 2019. The current challenge theme addresses the difficulty of constructing a speech synthesizer without any text or phonetic labels and requires a system that can (1) discover subword units in an unsupervised way, and (2) synthesize the speech with a target speaker's voice. Moreover, the system should also balance the discrimination score ABX, the bit-rate compression rate, and the naturalness and the intelligibility of the constructed voice. To tackle these problems and achieve the best tradeoff, we utilize a vector quantized variational autoencoder (VQ-VAE) and a multi-scale codebook-tospectrogram (Code2Spec) inverter trained by mean square error and adversarial loss. The VQ-VAE extracts the speech to a latent space, forces itself to map it into the nearest codebook and produces compressed representation. Next, the inverter generates a magnitude spectrogram to the target voice, given the codebook vectors from VQ-VAE. In our experiments, we also investigated several other clustering algorithms, including K-Means and GMM, and compared them with the VQ-VAE result on ABX scores and bit rates. Our proposed approach significantly improved the intelligibility (in CER), the MOS, and discrimination ABX scores compared to the official ZeroSpeech 2019 baseline or even the topline.

show abstract

Parallel inference of dirichlet process Gaussian mixture models for unsupervised acoustic modeling: a feasibility study

Cited by 49 publications

References 29 publications

UWSpeech: Speech to Speech Translation for Unwritten Languages

UWSpeech: Speech to Speech Translation for Unwritten Languages

Unsupervised Pattern Discovery from Thematic Speech Archives Based on Multilingual Bottleneck Features

VQVAE Unsupervised Unit Discovery and Multi-Scale Code2Spec Inverter for Zerospeech Challenge 2019

Contact Info

Product

Resources

About