ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8683275
|View full text |Cite
|
Sign up to set email alerts
|

Semantic Query-by-example Speech Search Using Visual Grounding

Abstract: A number of recent studies have started to investigate how speech systems can be trained on untranscribed speech by leveraging accompanying images at training time. Examples of tasks include keyword prediction and within-and acrossmode retrieval. Here we consider how such models can be used for query-by-example (QbE) search, the task of retrieving utterances relevant to a given spoken query. We are particularly interested in semantic QbE, where the task is not only to retrieve utterances containing exact insta… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
6
4

Relationship

2
8

Authors

Journals

citations
Cited by 19 publications
(11 citation statements)
references
References 36 publications
(65 reference statements)
0
11
0
Order By: Relevance
“…A number of approaches have been put forward for AWE-based QbE [1,40,41]. Here we use the simplified approach from [42]. Using an AWE model, we first embed the query segment.…”
Section: Query-by-example Speech Searchmentioning
confidence: 99%
“…A number of approaches have been put forward for AWE-based QbE [1,40,41]. Here we use the simplified approach from [42]. Using an AWE model, we first embed the query segment.…”
Section: Query-by-example Speech Searchmentioning
confidence: 99%
“…This makes the method effective as a semantic keyword spotter. This work has led to a number of follow-ups and extensions (e.g., Kamper & Roth, 2018;Kamper, Anastassiou, & Livescu, 2019;Kamper, Shakhnarovich, & Livescu, 2019;Olaleye, van Niekerk, & Kamper, 2020, ).…”
Section: Abstractpottingmentioning
confidence: 99%
“…This tagger has an output of 1000 image classes, but here we use a system vocabulary corresponding to W = 67 unique keyword types. This set of keywords is the same set used in [27,39], and includes words such as 'children', 'young', 'swimming', 'snowy' and 'riding'. The procedure used to select these keywords are detailed in [27]; it includes a human reviewer agreement step, which reduced the original set from 70 to 67 words.…”
Section: Datamentioning
confidence: 99%