ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053541
|View full text |Cite
|
Sign up to set email alerts
|

Unsupervised Pre-Training of Bidirectional Speech Encoders via Masked Reconstruction

Abstract: We propose an approach for pre-training speech representations via a masked reconstruction loss. Our pre-trained encoder networks are bidirectional and can therefore be used directly in typical bidirectional speech recognition models. The pre-trained networks can then be fine-tuned on a smaller amount of supervised data for speech recognition. Experiments with this approach on the LibriSpeech and Wall Street Journal corpora show promising results. We find that the main factors that lead to speech recognition i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
69
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 84 publications
(77 citation statements)
references
References 22 publications
0
69
0
Order By: Relevance
“…CPC2 is a modified version of the CPC architecture in [2,3]. The encoder architecture is unchanged (5 convolutional layers with kernel sizes [10,8,4,4,4], strides [5,4,2,2,2] and hidden dimension 256). We increase the depth of the auto-regressive network, which improves accuracy (see Supplementary S1) For the recurrent context nextwork, we use a 2-layer LSTM, as a tradeoff between feature quality and training speed.…”
Section: The Cpc2 Architecturementioning
confidence: 99%
“…CPC2 is a modified version of the CPC architecture in [2,3]. The encoder architecture is unchanged (5 convolutional layers with kernel sizes [10,8,4,4,4], strides [5,4,2,2,2] and hidden dimension 256). We increase the depth of the auto-regressive network, which improves accuracy (see Supplementary S1) For the recurrent context nextwork, we use a 2-layer LSTM, as a tradeoff between feature quality and training speed.…”
Section: The Cpc2 Architecturementioning
confidence: 99%
“…In this paper, we use a re-implementation of the CPC model [34], which we call CPC2. The encoder architecture is the same (5 convolutional layers with kernel sizes [10,8,4,4,4], strides [5,4,2,2,2] and hidden dimension 256), for the context network, we used 2-layer LSTM, and for the prediction network, we used a multi-head transformer [35], each of the 12 heads predicting one future time slice.…”
Section: Cpc Architecturementioning
confidence: 99%
“…Unsupervised representation learning has been studied as a topic of its own [1,2,3], but recently gained attention as a pretraining method to obtain speech features that can be fine tuned for downstream application with little labelled data [4,5,6,7,8]. This opens up the prospect of constructing speech technology for low resource languages.…”
Section: Introductionmentioning
confidence: 99%
“…This approach is referred to as semi-supervised learning. More recently, self-supervised learning methods that transform the input signal to learn powerful representations have been receiving a lot of attention [10,11,12,13,14,15]. In contrast to semi-supervised learning, self-supervised learning aims to improve the seed model by exploiting unlabelled data before adaptation on supervised data.…”
Section: Introductionmentioning
confidence: 99%
“…Self-supervised training approaches can be broadly grouped into two classes: (1) auto-regressive models that try to predict the future representations conditional on the past inputs [13,11] and (2) bidirectional models that learn to predict masked parts of the input [10,15,12]. In [12,10], authors explore the adaptation of a bidirectional network for ASR.…”
Section: Introductionmentioning
confidence: 99%