Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech

Chung, Yu-An; Glass, James

doi:10.21437/interspeech.2018-2341

Cited by 129 publications

(112 citation statements)

References 33 publications

Supporting

Mentioning

112

Contrasting

Order By: Relevance

“…Active learning [25] could further select useful parts of the dataset (we have provided SNR data to facilitate this effort). Yet another approach might apply language modeling techniques directly on unlabelled audio to improve the representations before fine-tuning them [26,27].…”

Section: Resultsmentioning

confidence: 99%

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Kahn

Rivière

Zheng

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

340

229

View full text Add to dashboard Cite

We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semisupervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art.Index Terms-unsupervised and semi-supervised learning, distant supervision, dataset, zero-and low resource ASR.

show abstract

Section: Resultsmentioning

confidence: 99%

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Kahn

Rivière

Zheng

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

340

229

View full text Add to dashboard Cite

show abstract

“…Following previous works [2,3,4,5,6,7,8], we evaluate different features and representations on downstream tasks, including: phoneme classification, speaker recognition, and sentiment classification on spoken content. For a fair comparison, each downstream task uses an identical model architecture and hyperparameters despite different input features.…”

Section: Methodsmentioning

confidence: 99%

Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders

Liu

Yang

Chi

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

289

222

View full text Add to dashboard Cite

We present Mockingjay as a new speech representation learning approach, where bidirectional Transformer encoders are pre-trained on a large amount of unlabeled speech. Previous speech representation methods learn through conditioning on past frames and predicting information about future frames. Whereas Mockingjay is designed to predict the current frame through jointly conditioning on both past and future contexts. The Mockingjay representation improves performance for a wide range of downstream tasks, including phoneme classification, speaker recognition, and sentiment classification on spoken content, while outperforming other approaches. Mockingjay is empirically powerful and can be fine-tuned with downstream models, with only 2 epochs we further improve performance dramatically. In a low resource setting with only 0.1% of labeled data, we outperform the result of Mel-features that uses all 100% labeled data.

show abstract

“…Interestingly, such a representation of words allows a prediction of activity of large parts of the human cortex as recorded by fMRI during story listening (Huth et al, 2016). Crucially, this principle has recently also been applied to speech instead of text corpora (Kamper et al, 2017;Chung & Glass, 2018), where an essential and non-trivial early step is to perform a segmentation on the speech material into word-like units. Despite currently suffering from relatively high word error rates, such unsupervised "zero-resource" speech models are an important step towards unbiased hypotheses about human speech recognition.…”

Section: Syllable Level Processing In Superior Temporal Regionsmentioning

confidence: 99%

Phoneme-level processing in low-frequency cortical responses to speech explained by acoustic features

Daube

Ince

Groß

2018

Preprint

View full text Add to dashboard Cite

When we listen to speech, we have to make sense of a waveform of sound pressure. Hierarchical models of speech perception assume that before giving rise to its final semantic meaning, the signal is transformed into unknown intermediate neuronal representations. Classically, studies of such intermediate representations are guided by linguistically defined concepts such as phonemes. Here we argue that in order to arrive at an unbiased understanding of the mechanisms of speech comprehension, the focus should instead lie on representations obtained directly from the stimulus. We illustrate our view with a strongly data-driven analysis of a dataset of 24 young, healthy humans who listened to a narrative of one hour duration while their magnetoencephalogram (MEG) was recorded. We find that two recent results, a performance gain of an encoding model based on acoustic and annotated linguistic features over a model based on acoustic features alone as well as the decoding of subgroups of phonemes from phoneme-locked responses, can be explained with an encoding model entirely based on acoustic features. These acoustic features capitalise on acoustic edges and outperform Gabor-filtered spectrograms, features with the potential to describe the spectrotemporal characteristics of individual phonemes. We conclude that models of brain responses based on linguistic features can serve as excellent benchmarks. However, we put forward that linguistic concepts are better used when interpreting models, not when building them. In doing so, we find that the results of our analyses favour syllables over phonemes as candidate intermediate speech representations visible with fast non-invasive neuroimaging.

show abstract

Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech

Cited by 129 publications

References 33 publications

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders

Phoneme-level processing in low-frequency cortical responses to speech explained by acoustic features

Contact Info

Product

Resources

About