Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-1592
|View full text |Cite
|
Sign up to set email alerts
|

Query-by-Example Search with Discriminative Neural Acoustic Word Embeddings

Abstract: Query-by-example search often uses dynamic time warping (DTW) for comparing queries and proposed matching segments. Recent work has shown that comparing speech segments by representing them as fixed-dimensional vectors -acoustic word embeddings -and measuring their vector distance (e.g., cosine distance) can discriminate between words more accurately than DTW-based approaches. We consider an approach to queryby-example search that embeds both the query and database segments according to a neural model, followe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
71
0

Year Published

2017
2017
2024
2024

Publication Types

Select...
5
4

Relationship

4
5

Authors

Journals

citations
Cited by 67 publications
(71 citation statements)
references
References 24 publications
0
71
0
Order By: Relevance
“…However, current methods require large amounts of transcribed data, which are available only for a small fraction of the world's languages [1]. This has prompted work on speech models that, instead of using exact transcriptions, can learn from weaker forms of supervision, e.g., known word pairs [2], [3], translation text [4]- [6], or unordered word labels [7]. The motivation for much of this work is that, even when high-quality ASR is infeasible, it may still be possible to learn low-resource models for practical tasks like retrieval and keyword prediction.…”
Section: Introductionmentioning
confidence: 99%
“…However, current methods require large amounts of transcribed data, which are available only for a small fraction of the world's languages [1]. This has prompted work on speech models that, instead of using exact transcriptions, can learn from weaker forms of supervision, e.g., known word pairs [2], [3], translation text [4]- [6], or unordered word labels [7]. The motivation for much of this work is that, even when high-quality ASR is infeasible, it may still be possible to learn low-resource models for practical tasks like retrieval and keyword prediction.…”
Section: Introductionmentioning
confidence: 99%
“…Our approach to semantic QbE is embedding-based: We learn an embedding function that maps from segments of speech-queries, search utterances, or sub-segments of search utterances-to fixed-dimensional vectors; we search for semantic matches by finding the minimum distance between query and search utterance embedding vectors. In this respect our approach is similar to those in recent embedding-based QbE work [14][15][16][17], and also some embedding-based spoken term detection work [18]. The key difference is that our embedding function must be learned in such a way that similar embedding vectors are semantically rather than phonetically similar.…”
Section: Introductionmentioning
confidence: 87%
“…The audio comprises around 37 hours of active speech in total. 67 keyword 1 While [14,16] use an approximate nearest neighbour search procedure, we use exhaustive search here. types are selected randomly from transcriptions of the training portion of the corpus.…”
Section: Experimental Setup and Evaluationmentioning
confidence: 99%
“…From a scientific perspective, discrete representations could be useful in cognitive models to study phonetic category learning in human infants [15][16][17][18]. From a technology perspective, such features could be used in downstream speech applications requiring symbolic or sparse input, e.g., for faster retrieval in speech search systems [19,20]. Here we consider the downstream task of speech synthesis within the context of the ZRSC'19.…”
Section: Introductionmentioning
confidence: 99%