2021 IEEE Spoken Language Technology Workshop (SLT) 2021
DOI: 10.1109/slt48900.2021.9383594
|View full text |Cite
|
Sign up to set email alerts
|

Acoustic Word Embeddings for Zero-Resource Languages Using Self-Supervised Contrastive Learning and Multilingual Adaptation

Abstract: Acoustic word embeddings (AWEs) are fixed-dimensional representations of variable-length speech segments. For zeroresource languages where labelled data is not available, one AWE approach is to use unsupervised autoencoder-based recurrent models. Another recent approach is to use multilingual transfer: a supervised AWE model is trained on several well-resourced languages and then applied to an unseen zero-resource language. We consider how a recent contrastive learning loss can be used in both the purely unsup… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
17
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
7
1

Relationship

1
7

Authors

Journals

citations
Cited by 18 publications
(18 citation statements)
references
References 52 publications
1
17
0
Order By: Relevance
“…Word discrimination. (word-disc) is the task of detecting whether two speech segments correspond to the same or different words [33] and is commonly used to evaluate acoustic word embeddings and other acoustic representations [34][35][36][37]. We follow a typical evaluation protocol, where we label a pair of segments as "same word" if the cosine similarity between their word-level representations is above some threshold, and measure performance via the average precision as the threshold is varied.…”
Section: Analysis Methodsmentioning
confidence: 99%
“…Word discrimination. (word-disc) is the task of detecting whether two speech segments correspond to the same or different words [33] and is commonly used to evaluate acoustic word embeddings and other acoustic representations [34][35][36][37]. We follow a typical evaluation protocol, where we label a pair of segments as "same word" if the cosine similarity between their word-level representations is above some threshold, and measure performance via the average precision as the threshold is varied.…”
Section: Analysis Methodsmentioning
confidence: 99%
“…We use the CONTRASTIVERNN AWE model of [33]. 2 It performed the best of the model variants considered for multilingual transfer in [33].…”
Section: Acoustic Word Embedding Modelmentioning
confidence: 99%
“…We use the CONTRASTIVERNN AWE model of [33]. 2 It performed the best of the model variants considered for multilingual transfer in [33]. The model consists of an encoder recurrent neural network (RNN) that produces fixed-dimensional representations from variable-length speech segments.…”
Section: Acoustic Word Embedding Modelmentioning
confidence: 99%
See 1 more Smart Citation
“…Past works address this challenge with attempts to adapt the training vocabularies to new languages [78,79,80], convert the lexicons of other languages to the target language [81,51], or extend the vocabulary and fine-tune on the target language's data [82,83,84]. Another recent direction is pre-training the models to learn word-level acoustic representations [85,86,87]. Apart from our proposed ASR systems, any ASR system attempting to recognize languageindependent (i.e., phonetic) units, such as the multilingual allophone approach proposed in [13], would be suitable for performing phonetic inventory discovery.…”
Section: Related Workmentioning
confidence: 99%