ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054438
|View full text |Cite
|
Sign up to set email alerts
|

Generative Pre-Training for Speech with Autoregressive Predictive Coding

Abstract: Learning meaningful and general representations from unannotated speech that are applicable to a wide range of tasks remains challenging. In this paper we propose to use autoregressive predictive coding (APC), a recently proposed self-supervised objective, as a generative pre-training approach for learning meaningful, non-specific, and transferable speech representations. We pre-train APC on largescale unlabeled data and conduct transfer learning experiments on three speech applications that require different … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
117
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 147 publications
(127 citation statements)
references
References 27 publications
0
117
0
Order By: Relevance
“…Improvements of end-to-end AST were also proposed using weakly supervised data [21] or adding a second attention mechanism [22]. While supervised pretraining for AST was investigated (see for instance [16]), we are aware of a single research group [5,7] that investigated selfsupervised pre-training for AST. However their experiments were done in a high resource setting and AST (for which only marginal gains were displayed) was solely investigated among other tasks, without an in-depth analysis of the representations learnt.…”
Section: End-to-end Automatic Speech Translationmentioning
confidence: 99%
See 4 more Smart Citations
“…Improvements of end-to-end AST were also proposed using weakly supervised data [21] or adding a second attention mechanism [22]. While supervised pretraining for AST was investigated (see for instance [16]), we are aware of a single research group [5,7] that investigated selfsupervised pre-training for AST. However their experiments were done in a high resource setting and AST (for which only marginal gains were displayed) was solely investigated among other tasks, without an in-depth analysis of the representations learnt.…”
Section: End-to-end Automatic Speech Translationmentioning
confidence: 99%
“…As shown in Figure 1, we extract either wav2vec features or filter-bank+pitch features (later denoted as fbanks) from speech input. 5 Depending on the experiments, mean and variance normalization (MVN) is optionally applied to the generated features. For wav2vec feature extraction, we either use an off- 6 Data augmentation through speed perturbation is also applied with factors of 0.9, 1.0, and 1.1 to the training data.…”
Section: Speech Features and Data Augmentationmentioning
confidence: 99%
See 3 more Smart Citations