Unsupervised Pre-Training of Bidirectional Speech Encoders via Masked Reconstruction

Wang, Weiran; Tang, Qingming; Livescu, Karen

doi:10.1109/icassp40776.2020.9053541

Cited by 84 publications

(77 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CPC2 is a modified version of the CPC architecture in [2,3]. The encoder architecture is unchanged (5 convolutional layers with kernel sizes [10,8,4,4,4], strides [5,4,2,2,2] and hidden dimension 256). We increase the depth of the auto-regressive network, which improves accuracy (see Supplementary S1) For the recurrent context nextwork, we use a 2-layer LSTM, as a tradeoff between feature quality and training speed.…”

Section: The Cpc2 Architecturementioning

confidence: 99%

Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Kharitonov

Rivière

Synnaeve

et al. 2021

2021 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

Contrastive Predictive Coding (CPC), based on predicting future segments of speech based on past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library and find that applying augmentation in the past is generally more efficient and yields better performances than other methods. We find that a combination of pitch modification, additive noise and reverberation substantially increase the performance of CPC (relative improvement of 18-22%), beating the reference Libri-light results with 600 times less data. Using an out-of-domain dataset, time-domain data augmentation can push CPC to be on par with the state of the art on the Zero Speech Benchmark 2017. We also show that time-domain data augmentation consistently improves downstream limited-supervision phoneme classification tasks by a factor of 12-15% relative.

show abstract

Section: The Cpc2 Architecturementioning

confidence: 99%

Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Kharitonov

Rivière

Synnaeve

et al. 2021

2021 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

show abstract

“…In this paper, we use a re-implementation of the CPC model [34], which we call CPC2. The encoder architecture is the same (5 convolutional layers with kernel sizes [10,8,4,4,4], strides [5,4,2,2,2] and hidden dimension 256), for the context network, we used 2-layer LSTM, and for the prediction network, we used a multi-head transformer [35], each of the 12 heads predicting one future time slice.…”

Section: Cpc Architecturementioning

confidence: 99%

“…Unsupervised representation learning has been studied as a topic of its own [1,2,3], but recently gained attention as a pretraining method to obtain speech features that can be fine tuned for downstream application with little labelled data [4,5,6,7,8]. This opens up the prospect of constructing speech technology for low resource languages.…”

Section: Introductionmentioning

confidence: 99%

Towards Unsupervised Learning of Speech Features in the Wild

Rivière

Dupoux

2021

2021 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

Recent work on unsupervised contrastive learning of speech representation has shown promising results, but so far has mostly been applied to clean, curated speech datasets. Can it also be used with unprepared audio data "in the wild"? Here, we explore three potential problems in this setting: (i) presence of non-speech data, (ii) noisy or low quality speech data, and (iii) imbalance in speaker distribution. We show that on the Libri-light train set, which is itself a relatively clean speech-only dataset, these problems combined can already have a performance cost of up to 30% relative for the ABX score. We show that the first two problems can be alleviated by data filtering, with voice activity detection selecting speech segments, while perplexity of a model trained with clean data helping to discard entire files. We show that the third problem can be alleviated by learning a speaker embedding in the predictive branch of the model. We show that these techniques build more robust speech features that can be transferred to an ASR task in the low resource setting.

show abstract

“…This approach is referred to as semi-supervised learning. More recently, self-supervised learning methods that transform the input signal to learn powerful representations have been receiving a lot of attention [10,11,12,13,14,15]. In contrast to semi-supervised learning, self-supervised learning aims to improve the seed model by exploiting unlabelled data before adaptation on supervised data.…”

Section: Introductionmentioning

confidence: 99%

“…Self-supervised training approaches can be broadly grouped into two classes: (1) auto-regressive models that try to predict the future representations conditional on the past inputs [13,11] and (2) bidirectional models that learn to predict masked parts of the input [10,15,12]. In [12,10], authors explore the adaptation of a bidirectional network for ASR.…”

Section: Introductionmentioning

confidence: 99%

Lattice-Free Mmi Adaptation of Self-Supervised Pretrained Acoustic Models

Vyas

Madikeri

Bourlard

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this work, we propose lattice-free MMI (LFMMI) for supervised adaptation of self-supervised pretrained acoustic model. We pretrain a Transformer model on thousand hours of untranscribed Librispeech data followed by supervised adaptation with LFMMI on three different datasets. Our results show that fine-tuning with LFMMI, we consistently obtain relative WER improvements of 10% and 35.3% on the clean and other test sets of Librispeech (100h), 10.8% on Switchboard (300h), and 4.3% on Swahili (38h) and 4.4% on Tagalog (84h) compared to the baseline trained only with supervised data.

show abstract

Unsupervised Pre-Training of Bidirectional Speech Encoders via Masked Reconstruction

Cited by 84 publications

References 22 publications

Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Towards Unsupervised Learning of Speech Features in the Wild

Lattice-Free Mmi Adaptation of Self-Supervised Pretrained Acoustic Models

Contact Info

Product

Resources

About