Deep convolutional acoustic word embeddings using word-pair side information

Kamper, Herman; Wang, Weiran; Livescu, Karen

doi:10.1109/icassp.2016.7472619

Cited by 148 publications

(174 citation statements)

References 29 publications

Supporting

Mentioning

170

Contrasting

Order By: Relevance

“…Several supervised and unsupervised acoustic embedding methods have been proposed. Supervised methods include convolutional [11][12][13] and recurrent neural network (RNN) models [14][15][16][17], trained with discriminative classification and contrastive losses. Unsupervised methods include using distances to a fixed reference set [10] and unsupervised autoencoding RNNs [18][19][20].…”

Section: Introductionmentioning

confidence: 99%

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

Kamper

Matusevych

Goldwater

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Acoustic word embeddings are fixed-dimensional representations of variable-length speech segments. In settings where unlabelled speech is the only available resource, such embeddings can be used in "zero-resource" speech search, indexing and discovery systems. Here we propose to train a single supervised embedding model on labelled data from multiple well-resourced languages and then apply it to unseen zeroresource languages. For this transfer learning approach, we consider two multilingual recurrent neural network models: a discriminative classifier trained on the joint vocabularies of all training languages, and a correspondence autoencoder trained to reconstruct word pairs. We test these using a word discrimination task on six target zero-resource languages. When trained on seven well-resourced languages, both models perform similarly and outperform unsupervised models trained on the zero-resource languages. With just a single training language, the second model works better, but performance depends more on the particular training-testing language pair.Index Terms-Acoustic word embeddings, multilingual models, zero-resource speech processing, query-by-example.

show abstract

Section: Introductionmentioning

confidence: 99%

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

Kamper

Matusevych

Goldwater

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Before the experiment, we implemented the baseline multiview approach [13] and trained it with the model and dataset provided by the authors to verify the performance improvement compared to the single-view approaches [9,10]. Then we established our initial model parameters as the same with the retuned baseline model on the WSJ dataset.…”

Section: Methodsmentioning

confidence: 99%

“…The first task is acoustic word discrimination, where we are given two word segments to determine whether they match or not. This task is equivalent to the objective of the single-view approach and has been used in prior papers [9,10,11,12,14,17]. We regard this task as our main evaluation task for training the proposed and baseline network architectures.…”

Section: Evaluation Tasksmentioning

confidence: 99%

Additional Shared Decoder on Siamese Multi-View Encoders for Learning Acoustic Word Embeddings

Jung

Lim

Goo

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Acoustic word embeddings -fixed-dimensional vector representations of arbitrary-length words -have attracted increasing interest in query-by-example spoken term detection. Recently, on the fact that the orthography of text labels partly reflects the phonetic similarity between the words' pronunciation, a multi-view approach has been introduced that jointly learns acoustic and text embeddings. It showed that it is possible to learn discriminative embeddings by designing the objective which takes text labels as well as word segments. In this paper, we propose a network architecture that expands the multi-view approach by combining the Siamese multiview encoders with a shared decoder network to maximize the effect of the relationship between acoustic and text embeddings in embedding space. Discriminatively trained with multi-view triplet loss and decoding loss, our proposed approach achieves better performance on acoustic word discrimination task with the WSJ dataset, resulting in 11.1% relative improvement in average precision. We also present experimental results on cross-view word discrimination and word level speech recognition tasks.Index Termsacoustic word embedding, query-byexample spoken term detection, multi-view learning, Siamese network, encoder-decoder

show abstract

“…Moreover, we will also investigate the use of different types of acoustic embeddings, such as those derived from siamese networks [24], that try to preserve distance of words both semantically and in acoustic space.…”

Section: Discussionmentioning

confidence: 99%

Exploring the use of acoustic embeddings in neural machine translation

Deena

Madhyastha

et al. 2017

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Neural Machine Translation (NMT) has recently demonstrated improved performance over statistical machine translation and relies on an encoder-decoder framework for translating text from source to target. The structure of NMT makes it amenable to add auxiliary features, which can provide complementary information to that present in the source text. In this paper, auxiliary features derived from accompanying audio, are investigated for NMT and are compared and combined with text-derived features. These acoustic embeddings can help resolve ambiguity in the translation, thus improving the output. The following features are experimented with: Latent Dirichlet Allocation (LDA) topic vectors and GMM subspace i-vectors derived from audio. These are contrasted against: skip-gram/Word2Vec features and LDA features derived from text. The results are encouraging and show that acoustic information does help with NMT, leading to an overall 3.3% relative improvement in BLEU scores.

show abstract

Deep convolutional acoustic word embeddings using word-pair side information

Cited by 148 publications

References 29 publications

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

Additional Shared Decoder on Siamese Multi-View Encoders for Learning Acoustic Word Embeddings

Exploring the use of acoustic embeddings in neural machine translation

Contact Info

Product

Resources

About