2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018
DOI: 10.1109/icassp.2018.8462002
|View full text |Cite
|
Sign up to set email alerts
|

Segmental Audio Word2Vec: Representing Utterances as Sequences of Vectors with Applications in Spoken Term Detection

Abstract: While Word2Vec represents words (in text) as vectors carrying semantic information, audio Word2Vec was shown to be able to represent signal segments of spoken words as vectors carrying phonetic structure information. Audio Word2Vec can be trained in an unsupervised way from an unlabeled corpus, except the word boundaries are needed. In this paper, we extend audio Word2Vec from word-level to utterance-level by proposing a new segmental audio Word2Vec, in which unsupervised spoken word boundary segmentation and … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
46
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 50 publications
(46 citation statements)
references
References 13 publications
0
46
0
Order By: Relevance
“…Term discovery systems were also used to provide training segments for a speech model in [22]. Although their systems are unsupervised, the intrinsic quality of embeddings was not their direct focus, as is also the case in [11,20]. Another study where ground truth segments are not used is [24], although true phoneme boundaries are assumed.…”
Section: Encoder-decoder Correspondence Autoencodermentioning
confidence: 99%
“…Term discovery systems were also used to provide training segments for a speech model in [22]. Although their systems are unsupervised, the intrinsic quality of embeddings was not their direct focus, as is also the case in [11,20]. Another study where ground truth segments are not used is [24], although true phoneme boundaries are assumed.…”
Section: Encoder-decoder Correspondence Autoencodermentioning
confidence: 99%
“…Our approach to semantic QbE is embedding-based: We learn an embedding function that maps from segments of speech-queries, search utterances, or sub-segments of search utterances-to fixed-dimensional vectors; we search for semantic matches by finding the minimum distance between query and search utterance embedding vectors. In this respect our approach is similar to those in recent embedding-based QbE work [14][15][16][17], and also some embedding-based spoken term detection work [18]. The key difference is that our embedding function must be learned in such a way that similar embedding vectors are semantically rather than phonetically similar.…”
Section: Introductionmentioning
confidence: 87%
“…There are many approaches for segmenting utterances automatically. Automatic segmentation of spoken words has been successfully trained and reported previously [32], so the training audio corpus in the present work has been previously segmented into phonetic words. A word and its corresponding phonetics form a token.…”
Section: A Stage 1: Phonetic Representation Acquisitionmentioning
confidence: 99%