Truly Unsupervised Acoustic Word Embeddings Using Weak Top-down Constraints in Encoder-decoder Models

Kamper, Herman

doi:10.1109/icassp.2019.8683639

Cited by 60 publications

(112 citation statements)

References 34 publications

(63 reference statements)

Supporting

Mentioning

108

Contrasting

Order By: Relevance

“…We used this data to tune the number of pairs for the CAE-RNN, the vocabulary size for the CLASSIFIERRNN and the number of training epochs. Other hyperparameters are set as in [21].…”

Section: Methodsmentioning

confidence: 99%

“…We next consider the unsupervised correspondence autoencoder RNN (CAE-RNN) of [21]. Since we do not have access acoustic word embedding Fig.…”

Section: Unsupervised Monolingual Acoustic Embeddingsmentioning

confidence: 99%

“…Unsupervised methods include using distances to a fixed reference set [10] and unsupervised autoencoding RNNs [18][19][20]. The recent unsupervised RNN of [21], which we refer to as the correspondence autoencoder RNN (CAE-RNN), is trained on pairs of word-like segments found in an unsupervised way. Unfortunately, while unsupervised methods are useful in that they can be used in zero-resource settings, there is still a large performance gap compared to supervised methods [21].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

Kamper

Matusevych

Goldwater

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Acoustic word embeddings are fixed-dimensional representations of variable-length speech segments. In settings where unlabelled speech is the only available resource, such embeddings can be used in "zero-resource" speech search, indexing and discovery systems. Here we propose to train a single supervised embedding model on labelled data from multiple well-resourced languages and then apply it to unseen zeroresource languages. For this transfer learning approach, we consider two multilingual recurrent neural network models: a discriminative classifier trained on the joint vocabularies of all training languages, and a correspondence autoencoder trained to reconstruct word pairs. We test these using a word discrimination task on six target zero-resource languages. When trained on seven well-resourced languages, both models perform similarly and outperform unsupervised models trained on the zero-resource languages. With just a single training language, the second model works better, but performance depends more on the particular training-testing language pair.Index Terms-Acoustic word embeddings, multilingual models, zero-resource speech processing, query-by-example.

show abstract

“…We used this data to tune the number of pairs for the CAE-RNN, the vocabulary size for the CLASSIFIERRNN and the number of training epochs. Other hyperparameters are set as in [21].…”

Section: Methodsmentioning

confidence: 99%

“…We next consider the unsupervised correspondence autoencoder RNN (CAE-RNN) of [21]. Since we do not have access acoustic word embedding Fig.…”

Section: Unsupervised Monolingual Acoustic Embeddingsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

Kamper

Matusevych

Goldwater

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…The first task is acoustic word discrimination, where we are given two word segments to determine whether they match or not. This task is equivalent to the objective of the single-view approach and has been used in prior papers [9,10,11,12,14,17]. We regard this task as our main evaluation task for training the proposed and baseline network architectures.…”

Section: Evaluation Tasksmentioning

confidence: 99%

Additional Shared Decoder on Siamese Multi-View Encoders for Learning Acoustic Word Embeddings

Jung

Lim

Goo

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Acoustic word embeddings -fixed-dimensional vector representations of arbitrary-length words -have attracted increasing interest in query-by-example spoken term detection. Recently, on the fact that the orthography of text labels partly reflects the phonetic similarity between the words' pronunciation, a multi-view approach has been introduced that jointly learns acoustic and text embeddings. It showed that it is possible to learn discriminative embeddings by designing the objective which takes text labels as well as word segments. In this paper, we propose a network architecture that expands the multi-view approach by combining the Siamese multiview encoders with a shared decoder network to maximize the effect of the relationship between acoustic and text embeddings in embedding space. Discriminatively trained with multi-view triplet loss and decoding loss, our proposed approach achieves better performance on acoustic word discrimination task with the WSJ dataset, resulting in 11.1% relative improvement in average precision. We also present experimental results on cross-view word discrimination and word level speech recognition tasks.Index Termsacoustic word embedding, query-byexample spoken term detection, multi-view learning, Siamese network, encoder-decoder

show abstract

“…However, both research efforts assumed that training utterances are already segmented into words while learning embeddings in an unsupervised way. Kamper [19] solved this mismatch by using an unsupervised term discovery system to find sample same-word pairs. For evaluating acoustic word embeddings, Ghannay et al [20,21] proposed to evaluate the intrinsic performances of acoustic word embeddings by comparing embedding similarity with the orthographic and phonetic similarity of the original words.…”

Section: Related Workmentioning

confidence: 99%

Linguistically-Informed Training of Acoustic Word Embeddings for Low-Resource Languages

Yang

Hirschberg

2019

Interspeech 2019

View full text Add to dashboard Cite

Acoustic word embeddings have been proven to be useful in query-by-example keyword search. Such embeddings are typically trained to distinguish the same word from a different word using exact orthographic representations; so, two different words will have dissimilar embeddings even if they are pronounced similarly or share the same stem. However, in real-world applications such as keyword search in low-resource languages, models are expected to find all derived and inflected forms for a certain keyword. In this paper, we address this mismatch by incorporating linguistic information when training neural acoustic word embeddings. We propose two linguistically-informed methods for training these embeddings, both of which, when we use metrics that consider non-exact matches, outperform state-of-the-art models on the Switchboard dataset. We also present results on Sinhala to show that models trained on English can be directly transferred to embed spoken words in a very different language with high accuracy.

show abstract

Truly Unsupervised Acoustic Word Embeddings Using Weak Top-down Constraints in Encoder-decoder Models

Cited by 60 publications

References 34 publications

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

Additional Shared Decoder on Siamese Multi-View Encoders for Learning Acoustic Word Embeddings

Linguistically-Informed Training of Acoustic Word Embeddings for Low-Resource Languages

Contact Info

Product

Resources

About