2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019
DOI: 10.1109/asru46091.2019.9003752
|View full text |Cite
|
Sign up to set email alerts
|

Multilingual Bottleneck Features for Query by Example Spoken Term Detection

Abstract: State of the art solutions to query by example spoken term detection (QbE-STD) usually rely on bottleneck feature representation of the query and audio document to perform dynamic time warping (DTW) based template matching. Here, we present a study on QbE-STD performance using several monolingual as well as multilingual bottleneck features extracted from feed forward networks. Then, we propose to employ residual networks (ResNet) to estimate the bottleneck features and show significant improvements over the co… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 14 publications
(7 citation statements)
references
References 26 publications
0
7
0
Order By: Relevance
“…Some examples of this approach using different models and units are [26][27][28][80][81][82][83][84][85][86][87][88]. More recently, this approach has been extended with automatic unit discovery [89][90][91] and deep neural networks (DNNs) for extracting bottleneck features [92,93].…”
Section: Query-by-example Spoken Term Detectionmentioning
confidence: 99%
“…Some examples of this approach using different models and units are [26][27][28][80][81][82][83][84][85][86][87][88]. More recently, this approach has been extended with automatic unit discovery [89][90][91] and deep neural networks (DNNs) for extracting bottleneck features [92,93].…”
Section: Query-by-example Spoken Term Detectionmentioning
confidence: 99%
“…As mentioned before, the triplet loss function is designed to learn embeddings that will be spanned closely in the embedding space for similar samples and far apart for non-similar ones. Unlike in previous studies [10], [24] where embeddings were given as fixedlength multidimensional vectors, our embedding model f θ outputs a sequence of embeddings with the length corresponding to the length of the input sequence x. We employ the property of soft-DTW that allows effective calculation of the distance between time-series of different lengths; therefore, the network is designed to generate embedding outputs of varying lengths.…”
Section: Proposed Learning Objectivementioning
confidence: 99%
“…In the first setup, the term is given in a phonetic or grapheme form, and in the second setup, the term is given in an acoustic form. The second setup is referred to as "query-byexample" (QbE) [1,2,3].…”
Section: Introductionmentioning
confidence: 99%
“…More recent works use various forms of deep neural networks for the QbE problem. Some approaches define an audio-embedding space such that if the term was uttered in the input audio example, it would be projected to an embedding vector near the embedding of the audio example [3,7]. Specifically, in [7], they create a shared embedding space for both audio data and the grapheme representation of terms.…”
Section: Introductionmentioning
confidence: 99%