Learning Word Embeddings: Unsupervised Methods for Fixed-size Representations of Variable-length Speech Segments

Holzenberger, Nils; Du, Mingxing; Karadayi, Julien; Riad, Rachid; Dupoux, Emmanuel

doi:10.21437/interspeech.2018-2364

Cited by 36 publications

(52 citation statements)

References 22 publications

(26 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This improvement is particularly notable since DTW is computationally more expensive: embedding comparisons with ENCDEC-CAE takes about 0.5 minutes on a single CPU core while DTW takes more than 60 minutes parallelised over 20 cores. In previous studies where embeddings were reported to outperform DTW, either ground truth word segments [18] or higher-dimensional embeddings were used [24].…”

Section: Resultsmentioning

confidence: 99%

“…As a baseline embedding method, we use downsampling by keeping 10 equally-spaced MFCC vectors from a segment with appropriate interpolation, giving a 130-dimensional embedding. This has proven a strong baseline in other work [24]. The same-different task can also be approached by using DTW alignment between test segments, where the alignment cost of the full sequences are used as a score for word discrimination.…”

Section: Experimental Setup and Evaluationmentioning

confidence: 99%

See 1 more Smart Citation

Truly Unsupervised Acoustic Word Embeddings Using Weak Top-down Constraints in Encoder-decoder Models

Kamper

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

106

View full text Add to dashboard Cite

We investigate unsupervised models that can map a variableduration speech segment to a fixed-dimensional representation. In settings where unlabelled speech is the only available resource, such acoustic word embeddings can form the basis for "zero-resource" speech search, discovery and indexing systems. Most existing unsupervised embedding methods still use some supervision, such as word or phoneme boundaries. Here we propose the encoder-decoder correspondence autoencoder (ENCDEC-CAE), which, instead of true word segments, uses automatically discovered segments: an unsupervised term discovery system finds pairs of words of the same unknown type, and the ENCDEC-CAE is trained to reconstruct one word given the other as input. We compare it to a standard encoderdecoder autoencoder (AE), a variational AE with a prior over its latent embedding, and downsampling. ENCDEC-CAE outperforms its closest competitor by 29% relative in average precision on two languages in a word discrimination task.Index Terms-Acoustic word embeddings, zero-resource speech processing, unsupervised learning, query-by-example.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Experimental Setup and Evaluationmentioning

confidence: 99%

Truly Unsupervised Acoustic Word Embeddings Using Weak Top-down Constraints in Encoder-decoder Models

Kamper

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

106

View full text Add to dashboard Cite

show abstract

“…Supervised methods include convolutional [11][12][13] and recurrent neural network (RNN) models [14][15][16][17], trained with discriminative classification and contrastive losses. Unsupervised methods include using distances to a fixed reference set [10] and unsupervised autoencoding RNNs [18][19][20]. The recent unsupervised RNN of [21], which we refer to as the correspondence autoencoder RNN (CAE-RNN), is trained on pairs of word-like segments found in an unsupervised way.…”

Section: Introductionmentioning

confidence: 99%

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

Kamper

Matusevych

Goldwater

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Acoustic word embeddings are fixed-dimensional representations of variable-length speech segments. In settings where unlabelled speech is the only available resource, such embeddings can be used in "zero-resource" speech search, indexing and discovery systems. Here we propose to train a single supervised embedding model on labelled data from multiple well-resourced languages and then apply it to unseen zeroresource languages. For this transfer learning approach, we consider two multilingual recurrent neural network models: a discriminative classifier trained on the joint vocabularies of all training languages, and a correspondence autoencoder trained to reconstruct word pairs. We test these using a word discrimination task on six target zero-resource languages. When trained on seven well-resourced languages, both models perform similarly and outperform unsupervised models trained on the zero-resource languages. With just a single training language, the second model works better, but performance depends more on the particular training-testing language pair.Index Terms-Acoustic word embeddings, multilingual models, zero-resource speech processing, query-by-example.

show abstract

“…For autoencoder-based unsupervised acoustic word embeddings, Chung at al. [17] and Holzenberger et al [18] used a recurrent autoencoder to learn embeddings without using wordpair information as supervision. However, both research efforts assumed that training utterances are already segmented into words while learning embeddings in an unsupervised way.…”

Section: Related Workmentioning

confidence: 99%

Linguistically-Informed Training of Acoustic Word Embeddings for Low-Resource Languages

Yang

Hirschberg

2019

Interspeech 2019

View full text Add to dashboard Cite

Acoustic word embeddings have been proven to be useful in query-by-example keyword search. Such embeddings are typically trained to distinguish the same word from a different word using exact orthographic representations; so, two different words will have dissimilar embeddings even if they are pronounced similarly or share the same stem. However, in real-world applications such as keyword search in low-resource languages, models are expected to find all derived and inflected forms for a certain keyword. In this paper, we address this mismatch by incorporating linguistic information when training neural acoustic word embeddings. We propose two linguistically-informed methods for training these embeddings, both of which, when we use metrics that consider non-exact matches, outperform state-of-the-art models on the Switchboard dataset. We also present results on Sinhala to show that models trained on English can be directly transferred to embed spoken words in a very different language with high accuracy.

show abstract

Learning Word Embeddings: Unsupervised Methods for Fixed-size Representations of Variable-length Speech Segments

Cited by 36 publications

References 22 publications

Truly Unsupervised Acoustic Word Embeddings Using Weak Top-down Constraints in Encoder-decoder Models

Truly Unsupervised Acoustic Word Embeddings Using Weak Top-down Constraints in Encoder-decoder Models

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

Linguistically-Informed Training of Acoustic Word Embeddings for Low-Resource Languages

Contact Info

Product

Resources

About