ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054202
| View full text |Cite
|
Sign up to set email alerts
|

Abstract: Acoustic word embeddings are fixed-dimensional representations of variable-length speech segments. In settings where unlabelled speech is the only available resource, such embeddings can be used in "zero-resource" speech search, indexing and discovery systems. Here we propose to train a single supervised embedding model on labelled data from multiple well-resourced languages and then apply it to unseen zeroresource languages. For this transfer learning approach, we consider two multilingual recurrent neural ne… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
18
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
2
1

Relationship

2
7

Authors

Journals

citations
Cited by 18 publications
(19 citation statements)
references
References 43 publications
(61 reference statements)
1
18
0
Order By: Relevance
“…Following previous work [24,25], we use a classification objective as our neural baseline (Fig. 1-a).…”
Section: Phone N-gram Detection Objectivementioning
confidence: 99%
“…Following previous work [24,25], we use a classification objective as our neural baseline (Fig. 1-a).…”
Section: Phone N-gram Detection Objectivementioning
confidence: 99%
“…However, there still exists a large performance gap between these unsupervised models and their supervised counterparts [11,26]. A recent alternative for obtaining AWEs on a zero-resource language is to use multilingual transfer learning [27][28][29][30][31]. The goal is to have the benefits of supervised learning by training a model on labelled data from multiple well-resourced languages, but to then apply the model to an unseen target zero-resource language without fine-tuning ita form of transductive transfer learning [32].…”
Section: Introductionmentioning
confidence: 99%
“…While modern speech models are in principle able to learn implicit structures such as emotions without explicit labels, it is impossible to determine the cause for systematic error when they are not. Datasets that contain labelled specialised speech characteristics such as the Ryerson Database of Emotional Speech and Song (RAVDESS) [9] not only allow researchers to identify if their model is susceptible to structural misclassification through targeted probing, but also inspire new methods to capture and understand these implicit structures [5], which in turn leads to design improvements of general speech recognition models [8].…”
Section: Introductionmentioning
confidence: 99%