An embedded segmental K-means model for unsupervised segmentation and clustering of speech

Kamper, Herman; Livescu, Karen; Goldwater, Sharon

doi:10.1109/asru.2017.8269008

Cited by 74 publications

(83 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Speech audio is parametrised as D = 13 dimensional static Mel-frequency cepstral coefficients (MFCCs). We use an embedding dimensionality of M = 130 throughout, since downstream systems such as the segmentation and clustering system of [8] are constrained to embedding sizes of this order. All encoder-decoder models have 3 encoder and 3 decoder unidirectional RNN layers, each with 400 units.…”

Section: Methodsmentioning

confidence: 99%

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

Kamper

Matusevych

Goldwater

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Acoustic word embeddings are fixed-dimensional representations of variable-length speech segments. In settings where unlabelled speech is the only available resource, such embeddings can be used in "zero-resource" speech search, indexing and discovery systems. Here we propose to train a single supervised embedding model on labelled data from multiple well-resourced languages and then apply it to unseen zeroresource languages. For this transfer learning approach, we consider two multilingual recurrent neural network models: a discriminative classifier trained on the joint vocabularies of all training languages, and a correspondence autoencoder trained to reconstruct word pairs. We test these using a word discrimination task on six target zero-resource languages. When trained on seven well-resourced languages, both models perform similarly and outperform unsupervised models trained on the zero-resource languages. With just a single training language, the second model works better, but performance depends more on the particular training-testing language pair.Index Terms-Acoustic word embeddings, multilingual models, zero-resource speech processing, query-by-example.

show abstract

Section: Methodsmentioning

confidence: 99%

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

Kamper

Matusevych

Goldwater

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…For all models we use an embedding dimensionality of M = 130, to be directly comparable to the downsampling baseline. More importantly, although other studies consider higherdimensional settings, downstream systems such as [14] are constrained to embedding sizes of this order. Neural network architectures were optimised on the English validation data.…”

Section: Experimental Setup and Evaluationmentioning

confidence: 99%

Truly Unsupervised Acoustic Word Embeddings Using Weak Top-down Constraints in Encoder-decoder Models

Kamper

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

106

View full text Add to dashboard Cite

We investigate unsupervised models that can map a variableduration speech segment to a fixed-dimensional representation. In settings where unlabelled speech is the only available resource, such acoustic word embeddings can form the basis for "zero-resource" speech search, discovery and indexing systems. Most existing unsupervised embedding methods still use some supervision, such as word or phoneme boundaries. Here we propose the encoder-decoder correspondence autoencoder (ENCDEC-CAE), which, instead of true word segments, uses automatically discovered segments: an unsupervised term discovery system finds pairs of words of the same unknown type, and the ENCDEC-CAE is trained to reconstruct one word given the other as input. We compare it to a standard encoderdecoder autoencoder (AE), a variational AE with a prior over its latent embedding, and downsampling. ENCDEC-CAE outperforms its closest competitor by 29% relative in average precision on two languages in a word discrimination task.Index Terms-Acoustic word embeddings, zero-resource speech processing, unsupervised learning, query-by-example.

show abstract

“…The present paper has a strong connection to recent work on unsupervised speech processing, especially the Zerospeech 2015 (Versteegh et al, 2015) and 2017 (Dunbar et al, 2017) shared tasks. Participating systems (Badino et al, 2015;Renshaw et al, 2015;Agenbag and Niesler, 2015;Baljekar et al, 2015;Räsänen et al, 2015;Lyzinski et al, 2015;Zeghidour et al, 2016;Heck et al, 2016;Srivastava and Shrivastava, 2016;Kamper et al, 2017b;Yuan et al, 2017;Heck et al, 2017;Shibata et al, 2017;Ansari et al, 2017a,b) perform unsupervised ABX discrimination and/or spoken term discovery on the basis of unlabeled speech alone. The design and evaluation of these and related systems (Kamper et al, , 2017aElsner and Shain, 2017;Räsänen et al, 2018) are oriented toward word-level modeling.…”

Section: Unsupervised Speech Processingmentioning

confidence: 99%

Measuring the perceptual availability of phonological features during language acquisition using unsupervised binary stochastic autoencoders

Shain¹,

Elsner²

2019

Proceedings of the 2019 Conference of the North

View full text Add to dashboard Cite

In this paper, we deploy binary stochastic neural autoencoder networks as models of infant language learning in two typologically unrelated languages (Xitsonga and English). We show that the drive to model auditory percepts leads to latent clusters that partially align with theory-driven phonemic categories. We further evaluate the degree to which theorydriven phonological features are encoded in the latent bit patterns, finding that some (e.g. [±approximant]), are well represented by the network in both languages, while others (e.g. [±spread glottis]) are less so. Together, these findings suggest that many reliable cues to phonemic structure are immediately available to infants from bottom-up perceptual characteristics alone, but that these cues must eventually be supplemented by top-down lexical and phonotactic information to achieve adult-like phone discrimination. Our results also suggest differences in degree of perceptual availability between features, yielding testable predictions as to which features might depend more or less heavily on top-down cues during child language acquisition.

show abstract

An embedded segmental K-means model for unsupervised segmentation and clustering of speech

Cited by 74 publications

References 39 publications

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

Truly Unsupervised Acoustic Word Embeddings Using Weak Top-down Constraints in Encoder-decoder Models

Measuring the perceptual availability of phonological features during language acquisition using unsupervised binary stochastic autoencoders

Contact Info

Product

Resources

About