Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1873
|View full text |Cite
|
Sign up to set email alerts
|

wav2vec: Unsupervised Pre-Training for Speech Recognition

Abstract: We explore unsupervised pre-training for speech recognition by learning representations of raw audio. wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training. We pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task. Our experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 36 % when only a few hours of transcribed dat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
503
0
8

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 811 publications
(562 citation statements)
references
References 18 publications
2
503
0
8
Order By: Relevance
“…We show the benefits of this by pre-training our modified CPC on 360 hours of unlabelled data from Librispeech and match the performance of the supervised model. This result not only confirms the findings of [6] but it also shows that unsupervised pre-training can match supervised pre-training with enough data (see Supplementary Section S2 with the larger Libri-light dataset [29]). In a second experiment, we compare the quality of our pre-trained features against other unsupervised methods on the Zerospeech2017.…”
Section: Cross-lingual Transfer Of Phoneme Featuressupporting
confidence: 83%
See 3 more Smart Citations
“…We show the benefits of this by pre-training our modified CPC on 360 hours of unlabelled data from Librispeech and match the performance of the supervised model. This result not only confirms the findings of [6] but it also shows that unsupervised pre-training can match supervised pre-training with enough data (see Supplementary Section S2 with the larger Libri-light dataset [29]). In a second experiment, we compare the quality of our pre-trained features against other unsupervised methods on the Zerospeech2017.…”
Section: Cross-lingual Transfer Of Phoneme Featuressupporting
confidence: 83%
“…However, CPC has the advantage of making no assumption about the nature or number of the training data samples. Recently, variants of CPC have been applied to monolingual ASR [6] and images [20].…”
Section: Unsupervised Learning Of Featuresmentioning
confidence: 99%
See 2 more Smart Citations
“…In recent years, however, RBM-based pre-training has been largely abandoned, because direct supervised training of deep neural networks has improved due to new techniques such as better initialization [3], non-saturating activation functions [4], and better control of generalization [5]. However, very recent work has begun to reconsider the value of unsupervised pre-training, specifically in the context of representation learning on a large set of unlabeled data, for use in supervised training on a smaller set of labeled data [6,7,8].…”
Section: Introductionmentioning
confidence: 99%