Generative Pre-Training for Speech with Autoregressive Predictive Coding

Chung, Yu-An; Glass, James

doi:10.1109/icassp40776.2020.9054438

Cited by 147 publications

(127 citation statements)

References 27 publications

Supporting

Mentioning

117

Contrasting

Order By: Relevance

“…Improvements of end-to-end AST were also proposed using weakly supervised data [21] or adding a second attention mechanism [22]. While supervised pretraining for AST was investigated (see for instance [16]), we are aware of a single research group [5,7] that investigated selfsupervised pre-training for AST. However their experiments were done in a high resource setting and AST (for which only marginal gains were displayed) was solely investigated among other tasks, without an in-depth analysis of the representations learnt.…”

Section: End-to-end Automatic Speech Translationmentioning

confidence: 99%

“…As shown in Figure 1, we extract either wav2vec features or filter-bank+pitch features (later denoted as fbanks) from speech input. 5 Depending on the experiments, mean and variance normalization (MVN) is optionally applied to the generated features. For wav2vec feature extraction, we either use an off- 6 Data augmentation through speed perturbation is also applied with factors of 0.9, 1.0, and 1.1 to the training data.…”

Section: Speech Features and Data Augmentationmentioning

confidence: 99%

“…Self-supervised learning using huge unlabeled data has been explored with very promising results for image processing [1] and natural language processing [2]. Recent works investigated selfsupervised representation learning from speech [3,4,5]. They were successful to improve performance on downstream tasks such as speech recognition.…”

Section: Introductionmentioning

confidence: 99%

“…An easier learning objective is introduced in Contrastive Predictive Coding (CPC) which consists in distinguishing a true future audio frame from negatives [3,8,9]. [5] shows that such representations are useful to improve several speech tasks while [4] extends those works by looking at the representations' robustness to domain and language shifts. In the same vein, [10] compares self-supervised and supervised pre-training for ASR and shows that CPC pre-training extracts features that transfer well to other languages, being on par or even outperforming supervised pretraining.…”

Section: Introductionmentioning

confidence: 99%

“…Practically each z i encodes 30ms of speech every 10ms. As for c i , the total receptive field of the context network is 210ms.3 https://github.com/pytorch/fairseq/blob/ master/examples/wav2vec/4 Note that these statistics were measured on our version of How2 downloaded on July 12, 2019[25] 5. Our preliminary experiments on How2 10% with MFCC features which lead to similar performance as filter-bank are not presented here.…”

mentioning

confidence: 99%

See 4 more Smart Citations

Investigating Self-Supervised Pre-Training for End-to-End Speech Translation

Nguyen¹,

Bougares²,

Tomashenko³

et al. 2020

Interspeech 2020

View full text Add to dashboard Cite

Self-supervised learning from raw speech has been proven beneficial to improve automatic speech recognition (ASR). We investigate here its impact on end-to-end automatic speech translation (AST) performance. We use a contrastive predictive coding (CPC) model pre-trained from unlabeled speech as a feature extractor for a downstream AST task. We show that selfsupervised pre-training is particularly efficient in low resource settings and that fine-tuning CPC models on the AST training data further improves performance. Even in higher resource settings, ensembling AST models trained with filter-bank and CPC representations leads to near state-of-the-art models without using any ASR pre-training. This might be particularly beneficial when one needs to develop a system that translates from speech in a language with poorly standardized orthography or even from speech in an unwritten language.

show abstract

Section: End-to-end Automatic Speech Translationmentioning

confidence: 99%

Section: Speech Features and Data Augmentationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 3 more Smart Citations

Investigating Self-Supervised Pre-Training for End-to-End Speech Translation

Nguyen¹,

Bougares²,

Tomashenko³

et al. 2020

Interspeech 2020

View full text Add to dashboard Cite

show abstract

Applying machine learning to primate bioacoustics: Review and perspectives

Cauzinille,

Favre,

Marxer

et al. 2024

American J Primatol

View full text Add to dashboard Cite

This paper provides a comprehensive review of the use of computational bioacoustics as well as signal and speech processing techniques in the analysis of primate vocal communication. We explore the potential implications of machine learning and deep learning methods, from the use of simple supervised algorithms to more recent self‐supervised models, for processing and analyzing large data sets obtained within the emergence of passive acoustic monitoring approaches. In addition, we discuss the importance of automated primate vocalization analysis in tackling essential questions on animal communication and highlighting the role of comparative linguistics in bioacoustic research. We also examine the challenges associated with data collection and annotation and provide insights into potential solutions. Overall, this review paper runs through a set of common or innovative perspectives and applications of machine learning for primate vocal communication analysis and outlines opportunities for future research in this rapidly developing field.

show abstract

Speaker-Invariant Speech-to-Intent Classification for Low-Resource Languages

Ignatius

Thayasivam

2021

Speech and Computer

View full text Add to dashboard Cite

Generative Pre-Training for Speech with Autoregressive Predictive Coding

Cited by 147 publications

References 27 publications

Investigating Self-Supervised Pre-Training for End-to-End Speech Translation

Investigating Self-Supervised Pre-Training for End-to-End Speech Translation

Applying machine learning to primate bioacoustics: Review and perspectives

Speaker-Invariant Speech-to-Intent Classification for Low-Resource Languages

Contact Info

Product

Resources

About