Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1399
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Open Vocabulary Keyword Search

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 7 publications
(4 citation statements)
references
References 0 publications
0
4
0
Order By: Relevance
“…The embedding neural network can be any network architecture although only FSMN is used in this paper. Therefore, to future improve result, we will train more architecture e.g [12][13] [14] to better performance. Besides this, the overlapping property may be able to used for acoustic pattern analysis more physically and we will try to find relation through vibration normal modes and acoustic states.…”
Section: Discussionmentioning
confidence: 99%
“…The embedding neural network can be any network architecture although only FSMN is used in this paper. Therefore, to future improve result, we will train more architecture e.g [12][13] [14] to better performance. Besides this, the overlapping property may be able to used for acoustic pattern analysis more physically and we will try to find relation through vibration normal modes and acoustic states.…”
Section: Discussionmentioning
confidence: 99%
“…More recently, ASR-free KWS methods have sought to eschew the ASR and its concomitant complexities [6][7][8][9][10][11][12]. Instead of relying on the output of an ASR system 1 , a neural network is trained in an end-to-end (E2E) fashion to locate written queries in large spoken archives.…”
Section: Introductionmentioning
confidence: 99%
“…We design two-stream networks to reliably embed linguistic representations of speech and text sequences within a common latent space. Since the audio-text joint latent space places linguistically similar embeddings close to each other [7,8], it is possible to distinguish keywords from other speech inputs. Based on these representations, our proposed method decides whether the input speech contains a keyword or not, by using a cross-attention mechanism [9,10,11].…”
Section: Introductionmentioning
confidence: 99%