Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-2341
|View full text |Cite
|
Sign up to set email alerts
|

Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech

Abstract: In this paper, we propose a novel deep neural network architecture, Speech2Vec, for learning fixed-length vector representations of audio segments excised from a speech corpus, where the vectors contain semantic information pertaining to the underlying spoken words, and are close to other vectors in the embedding space if their corresponding underlying spoken words are semantically similar. The proposed model can be viewed as a speech version of Word2Vec [1]. Its design is based on a RNN Encoder-Decoder framew… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
112
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
5
4
1

Relationship

1
9

Authors

Journals

citations
Cited by 129 publications
(112 citation statements)
references
References 33 publications
0
112
0
Order By: Relevance
“…Active learning [25] could further select useful parts of the dataset (we have provided SNR data to facilitate this effort). Yet another approach might apply language modeling techniques directly on unlabelled audio to improve the representations before fine-tuning them [26,27].…”
Section: Resultsmentioning
confidence: 99%
“…Active learning [25] could further select useful parts of the dataset (we have provided SNR data to facilitate this effort). Yet another approach might apply language modeling techniques directly on unlabelled audio to improve the representations before fine-tuning them [26,27].…”
Section: Resultsmentioning
confidence: 99%
“…Following previous works [2,3,4,5,6,7,8], we evaluate different features and representations on downstream tasks, including: phoneme classification, speaker recognition, and sentiment classification on spoken content. For a fair comparison, each downstream task uses an identical model architecture and hyperparameters despite different input features.…”
Section: Methodsmentioning
confidence: 99%
“…Interestingly, such a representation of words allows a prediction of activity of large parts of the human cortex as recorded by fMRI during story listening (Huth et al, 2016). Crucially, this principle has recently also been applied to speech instead of text corpora (Kamper et al, 2017;Chung & Glass, 2018), where an essential and non-trivial early step is to perform a segmentation on the speech material into word-like units. Despite currently suffering from relatively high word error rates, such unsupervised "zero-resource" speech models are an important step towards unbiased hypotheses about human speech recognition.…”
Section: Syllable Level Processing In Superior Temporal Regionsmentioning
confidence: 99%