Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-2364
|View full text |Cite
|
Sign up to set email alerts
|

Learning Word Embeddings: Unsupervised Methods for Fixed-size Representations of Variable-length Speech Segments

Abstract: Fixed-length embeddings of words are very useful for a variety of tasks in speech and language processing. Here we systematically explore two methods of computing fixed-length embeddings for variable-length sequences. We evaluate their susceptibility to phonetic and speaker-specific variability on English, a high resource language, and Xitsonga, a low resource language, using two evaluation metrics: ABX word discrimination and ROC-AUC on same-different phoneme n-grams. We show that a simple downsampling method… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
52
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 36 publications
(52 citation statements)
references
References 22 publications
(26 reference statements)
0
52
0
Order By: Relevance
“…This improvement is particularly notable since DTW is computationally more expensive: embedding comparisons with ENCDEC-CAE takes about 0.5 minutes on a single CPU core while DTW takes more than 60 minutes parallelised over 20 cores. In previous studies where embeddings were reported to outperform DTW, either ground truth word segments [18] or higher-dimensional embeddings were used [24].…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…This improvement is particularly notable since DTW is computationally more expensive: embedding comparisons with ENCDEC-CAE takes about 0.5 minutes on a single CPU core while DTW takes more than 60 minutes parallelised over 20 cores. In previous studies where embeddings were reported to outperform DTW, either ground truth word segments [18] or higher-dimensional embeddings were used [24].…”
Section: Resultsmentioning
confidence: 99%
“…As a baseline embedding method, we use downsampling by keeping 10 equally-spaced MFCC vectors from a segment with appropriate interpolation, giving a 130-dimensional embedding. This has proven a strong baseline in other work [24]. The same-different task can also be approached by using DTW alignment between test segments, where the alignment cost of the full sequences are used as a score for word discrimination.…”
Section: Experimental Setup and Evaluationmentioning
confidence: 99%
“…Supervised methods include convolutional [11][12][13] and recurrent neural network (RNN) models [14][15][16][17], trained with discriminative classification and contrastive losses. Unsupervised methods include using distances to a fixed reference set [10] and unsupervised autoencoding RNNs [18][19][20]. The recent unsupervised RNN of [21], which we refer to as the correspondence autoencoder RNN (CAE-RNN), is trained on pairs of word-like segments found in an unsupervised way.…”
Section: Introductionmentioning
confidence: 99%
“…For autoencoder-based unsupervised acoustic word embeddings, Chung at al. [17] and Holzenberger et al [18] used a recurrent autoencoder to learn embeddings without using wordpair information as supervision. However, both research efforts assumed that training utterances are already segmented into words while learning embeddings in an unsupervised way.…”
Section: Related Workmentioning
confidence: 99%