Autoregressive Co-Training for Learning Discrete Speech Representation

Yeh, Sung-Lin; Tang, Hao

doi:10.21437/interspeech.2022-530

Cited by 5 publications

(7 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To study the representations learned by our neural HMMs, we evaluate them on phone classification on Wall Street Journal (WSJ) and phone segmentation on TIMIT. We follow the same setting described in prior work [6,13,5], using Lib-riSpeech train-clean-360 for pre-training. Phone classification on WSJ is trained on 90% of si284, and evaluated on dev93.…”

Section: Methodsmentioning

confidence: 99%

“…Phone classification on WSJ is trained on 90% of si284, and evaluated on dev93. Following [5], we also evaluate phone cluster purity on WSJ si284. For phone segmentation on TIMIT, we do not follow the setting in other studies [24,22].…”

Section: Methodsmentioning

confidence: 99%

“…where W is a linear projection, V is the codebook, and d is the dimension of x t . We simply choose W the identity matrix, setting the codeword dimension to d. The choice of Gaussian is aligned with the L 2 loss in [5,14]. Note that the parametrization does not introduce any additional parameters compared to VQ-APC.…”

Section: Neural Parametrizationmentioning

confidence: 99%

“…Discrete latent variable models for self-supervised learning [1,2,3,4] have shown strong performance in speech representation learning. Assuming discreteness in selfsupervised models aligns well with phonological categories in speech perception, and improves the accessibility of phonetic information [2,5]. Beyond discreteness, dependencies among discrete variables could be useful not only for speech recognition but also for speech segmentation and acoustic unit discovery.…”

Section: Introductionmentioning

confidence: 99%

“…Beyond discreteness, dependencies among discrete variables could be useful not only for speech recognition but also for speech segmentation and acoustic unit discovery. However, most self-supervised approaches [3,6,5] do not consider dependencies among discrete latent variables. In this work, we study the options of learning dependencies among latent variables in the context of selfsupervised learning.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Learning Dependencies of Discrete Speech Representations with Neural Hidden Markov Models

Yeh¹,

Tang²

2022

Preprint

View full text Add to dashboard Cite

While discrete latent variable models have had great success in self-supervised learning, most models assume that frames are independent. Due to the segmental nature of phonemes in speech perception, modeling dependencies among latent variables at the frame level can potentially improve the learned representations on phonetic-related tasks. In this work, we assume Markovian dependencies among latent variables, and propose to learn speech representations with neural hidden Markov models. Our general framework allows us to compare to self-supervised models that assume independence, while keeping the number of parameters fixed. The added dependencies improve the accessibility of phonetic information, phonetic segmentation, and the cluster purity of phones, showcasing the benefit of the assumed dependencies.

show abstract

Section: Methodsmentioning

confidence: 99%