Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings

Levin, Keith; Henry, Katharine; Jansen, Aren; Livescu, Karen

doi:10.1109/asru.2013.6707765

Cited by 118 publications

(171 citation statements)

References 23 publications

Supporting

Mentioning

165

Contrasting

Order By: Relevance

“…Supervised methods include convolutional [11][12][13] and recurrent neural network (RNN) models [14][15][16][17], trained with discriminative classification and contrastive losses. Unsupervised methods include using distances to a fixed reference set [10] and unsupervised autoencoding RNNs [18][19][20]. The recent unsupervised RNN of [21], which we refer to as the correspondence autoencoder RNN (CAE-RNN), is trained on pairs of word-like segments found in an unsupervised way.…”

Section: Introductionmentioning

confidence: 99%

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

Kamper

Matusevych

Goldwater

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Acoustic word embeddings are fixed-dimensional representations of variable-length speech segments. In settings where unlabelled speech is the only available resource, such embeddings can be used in "zero-resource" speech search, indexing and discovery systems. Here we propose to train a single supervised embedding model on labelled data from multiple well-resourced languages and then apply it to unseen zeroresource languages. For this transfer learning approach, we consider two multilingual recurrent neural network models: a discriminative classifier trained on the joint vocabularies of all training languages, and a correspondence autoencoder trained to reconstruct word pairs. We test these using a word discrimination task on six target zero-resource languages. When trained on seven well-resourced languages, both models perform similarly and outperform unsupervised models trained on the zero-resource languages. With just a single training language, the second model works better, but performance depends more on the particular training-testing language pair.Index Terms-Acoustic word embeddings, multilingual models, zero-resource speech processing, query-by-example.

show abstract

Section: Introductionmentioning

confidence: 99%

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

Kamper

Matusevych

Goldwater

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…First, features are extracted at the frame level [17,18,19]. Second, DTW is performed to compare the feature matrix of the templates and the test segment.…”

Section: Dtw Baseline Systemmentioning

confidence: 99%

“…This embeds audio segments of different length into a fixed-dimensional space, therefore vector distance can be used for similarity measurement. Our method only requires a forward pass computation of the neural network, followed by a vector distance computation, and therefore is more efficient than [15] where an LVCSR is involved and [17] where multiple DTW computations are necessary. It also requires less computation than [18,19] since vector distance is used instead of DTW.…”

Section: Introductionmentioning

confidence: 99%

“…This is computationally expensive since the LVCSR system involves speaker adaptation, discriminative features and model transformations [16]. In [17] graph-based method is proposed to embed audio segments into a fixed-dimensional space, but dynamic time warping (DTW) is performed between the test audio segment and all the training segments in order to compute the embedding, which can be slow given large number of training segments. In [18,19], Gaussian or phoneme posteriorgrams are generated as templates from example keywords, and DTW is used to compare the templates.…”

Section: Introductionmentioning

confidence: 99%

“…In [18,19], Gaussian or phoneme posteriorgrams are generated as templates from example keywords, and DTW is used to compare the templates. Though this type of DTW-based methods have well-known inadequacies [17], it is the most appropriate KWS baseline for our application.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Query-by-example keyword spotting using long short-term memory networks

Chen

Parada

Sainath

2015

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

133

View full text Add to dashboard Cite

We present a novel approach to query-by-example keyword spotting (KWS) using a long short-term memory (LSTM) recurrent neural network-based feature extractor. In our approach, we represent each keyword using a fixed-length feature vector obtained by running the keyword audio through a word-based LSTM acoustic model. We use the activations prior to the softmax layer of the LSTM as our keyword-vector. At runtime, we detect the keyword by extracting the same feature vector from a sliding window and computing a simple similarity score between this test vector and the keyword vector. With clean speech, we achieve 86% relative false rejection rate reduction at 0.5% false alarm rate when compared to a competitive phoneme posteriorgram with dynamic time warping KWS system, while the reduction in the presence of babble noise is 67%. Our system has a small memory footprint, low computational cost, and high precision, making it suitable for on-device applications.

show abstract

Unsupervised Discovery of Sign Terms by K-Nearest Neighbours Approach

Polat

Saraçlar

2020

Computer Vision – ECCV 2020 Workshops

View full text Add to dashboard Cite

Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings

Cited by 118 publications

References 23 publications

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

Query-by-example keyword spotting using long short-term memory networks

Unsupervised Discovery of Sign Terms by K-Nearest Neighbours Approach

Contact Info

Product

Resources

About