wav2vec: Unsupervised Pre-Training for Speech Recognition

Schneider, Steffen; Baevski, Alexei; Collobert, Ronan; Auli, Michael

doi:10.21437/interspeech.2019-1873

Cited by 811 publications

(562 citation statements)

References 18 publications

Supporting

Mentioning

503

Contrasting

Unclassified

Order By: Relevance

“…We show the benefits of this by pre-training our modified CPC on 360 hours of unlabelled data from Librispeech and match the performance of the supervised model. This result not only confirms the findings of [6] but it also shows that unsupervised pre-training can match supervised pre-training with enough data (see Supplementary Section S2 with the larger Libri-light dataset [29]). In a second experiment, we compare the quality of our pre-trained features against other unsupervised methods on the Zerospeech2017.…”

Section: Cross-lingual Transfer Of Phoneme Featuressupporting

confidence: 83%

“…However, CPC has the advantage of making no assumption about the nature or number of the training data samples. Recently, variants of CPC have been applied to monolingual ASR [6] and images [20].…”

Section: Unsupervised Learning Of Featuresmentioning

confidence: 99%

“…Recently, several works [4,5] have proposed promising methods to train monolingual audio representations without supervision. In particular, Schneider et al [6] shows that the unsupervised pre-training method of van den Oord [4] improves the quality of automatic speech recognition (ASR) on several competitive benchmarks. In this paper, we are interested to see if similar unsupervised pre-training methods can be leveraged in a cross-lingual setting to improve the quality of phoneme representations for low resource languages.…”

Section: Introductionmentioning

confidence: 99%

“…We focus on the contrastive predictive coding (CPC) method of van den Oord [4] since Schneider et al [6] has shown its benefit for pre-training features for ASR. CPC is a form of forward modeling in the feature space [7]: it predicts the near future windows in an audio sequence while contrasting with windows from other sequences or more distant in time.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Unsupervised Pretraining Transfers Well Across Languages

Rivière

Joulin

Mazaré

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

154

144

View full text Add to dashboard Cite

Cross-lingual and multi-lingual training of Automatic Speech Recognition (ASR) has been extensively investigated in the supervised setting. This assumes the existence of a parallel corpus of speech and orthographic transcriptions. Recently, contrastive predictive coding (CPC) algorithms have been proposed to pretrain ASR systems with unlabelled data. In this work, we investigate whether unsupervised pretraining transfers well across languages. We show that a slight modification of the CPC pretraining extracts features that transfer well to other languages, being on par or even outperforming supervised pretraining. This shows the potential of unsupervised methods for languages with few linguistic resources.

show abstract

Section: Cross-lingual Transfer Of Phoneme Featuressupporting

confidence: 83%

Section: Unsupervised Learning Of Featuresmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Unsupervised Pretraining Transfers Well Across Languages

Rivière

Joulin

Mazaré

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

154

144

View full text Add to dashboard Cite

show abstract

“…In recent years, however, RBM-based pre-training has been largely abandoned, because direct supervised training of deep neural networks has improved due to new techniques such as better initialization [3], non-saturating activation functions [4], and better control of generalization [5]. However, very recent work has begun to reconsider the value of unsupervised pre-training, specifically in the context of representation learning on a large set of unlabeled data, for use in supervised training on a smaller set of labeled data [6,7,8].…”

Section: Introductionmentioning

confidence: 99%

Unsupervised Pre-Training of Bidirectional Speech Encoders via Masked Reconstruction

Wang

Tang

Livescu

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We propose an approach for pre-training speech representations via a masked reconstruction loss. Our pre-trained encoder networks are bidirectional and can therefore be used directly in typical bidirectional speech recognition models. The pre-trained networks can then be fine-tuned on a smaller amount of supervised data for speech recognition. Experiments with this approach on the LibriSpeech and Wall Street Journal corpora show promising results. We find that the main factors that lead to speech recognition improvements are: masking segments of sufficient width in both time and frequency, pre-training on a much larger amount of unlabeled data than the labeled data, and domain adaptation when the unlabeled and labeled data come from different domains. The gain from pre-training is additive to that of supervised data augmentation.

show abstract

Applying machine learning to primate bioacoustics: Review and perspectives

Cauzinille,

Favre,

Marxer

et al. 2024

American J Primatol

View full text Add to dashboard Cite

This paper provides a comprehensive review of the use of computational bioacoustics as well as signal and speech processing techniques in the analysis of primate vocal communication. We explore the potential implications of machine learning and deep learning methods, from the use of simple supervised algorithms to more recent self‐supervised models, for processing and analyzing large data sets obtained within the emergence of passive acoustic monitoring approaches. In addition, we discuss the importance of automated primate vocalization analysis in tackling essential questions on animal communication and highlighting the role of comparative linguistics in bioacoustic research. We also examine the challenges associated with data collection and annotation and provide insights into potential solutions. Overall, this review paper runs through a set of common or innovative perspectives and applications of machine learning for primate vocal communication analysis and outlines opportunities for future research in this rapidly developing field.

show abstract

wav2vec: Unsupervised Pre-Training for Speech Recognition

Cited by 811 publications

References 18 publications

Unsupervised Pretraining Transfers Well Across Languages

Unsupervised Pretraining Transfers Well Across Languages

Unsupervised Pre-Training of Bidirectional Speech Encoders via Masked Reconstruction

Applying machine learning to primate bioacoustics: Review and perspectives

Contact Info

Product

Resources

About