Semi-Supervised Learning with Data Augmentation for End-to-End ASR

Weninger, Felix; Mana, Franco; Gemello, Roberto; Andrés-Ferrer, Jesús; Zhan, Puming

doi:10.21437/interspeech.2020-1337

Cited by 22 publications

(14 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We focus on self-training [21] or pseudo-labeling (PL) [22], which has recently been adopted for semi-supervised E2E ASR and shown to be effective [23][24][25][26][27][28][29][30][31][32]. In PL, a teacher (base) model is first trained on labeled data and used to generate * Research conducted during an internship at MERL pseudo-labels for unlabeled data.…”

Section: Introductionmentioning

confidence: 99%

Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition

Higuchi

Moritz²,

Roux³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

Pseudo-labeling (PL) has been shown to be effective in semisupervised automatic speech recognition (ASR), where a base model is self-trained with pseudo-labels generated from unlabeled data. While PL can be further improved by iteratively updating pseudo-labels as the model evolves, most of the previous approaches involve inefficient retraining of the model or intricate control of the label update. We present momentum pseudo-labeling (MPL), a simple yet effective strategy for semisupervised ASR. MPL consists of a pair of online and offline models that interact and learn from each other, inspired by the mean teacher method. The online model is trained to predict pseudo-labels generated on the fly by the offline model. The offline model maintains a momentum-based moving average of the online model. MPL is performed in a single training process and the interaction between the two models effectively helps them reinforce each other to improve the ASR performance. We apply MPL to an end-to-end ASR model based on the connectionist temporal classification. The experimental results demonstrate that MPL effectively improves over the base model and is scalable to different semi-supervised scenarios with varying amounts of data or domain mismatch.

show abstract

Section: Introductionmentioning

confidence: 99%

Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition

Higuchi

Moritz²,

Roux³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

show abstract

“…[15,24]) and an encoder-decoder structure with attention (cf. [2,33]). The far-field ASR task is treated as a sequenceto-sequence learning problem: The model M is trained to predict a sequence of symbols yj (here, we use sub-word units) from the multi-channel complex spectrum X ∈ C T ×F ×C , where T is the number of frames, F is the number of frequency bins, and C is the number of channels in a input utterance.…”

Section: End-to-end Multi-channel Asrmentioning

confidence: 99%

“…The decoder is composed of 2 LSTM layers with size 1024, and the dropout rates are set to 0.1 and 0.4 for the first and second layer respectively. The training recipe is similar to [33]. SA is applied with Fmax = 15, mF = 2 in the ASR feature domain (80-dimensional log-Mel features).…”

Section: Test Datamentioning

confidence: 99%

ChannelAugment: Improving generalization of multi-channel ASR by training with input channel randomization

Gaudesi¹,

Weninger²,

Sharma³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

End-to-end (E2E) multi-channel ASR systems show state-of-the-art performance in far-field ASR tasks by joint training of a multichannel front-end along with the ASR model. The main limitation of such systems is that they are usually trained with data from a fixed array geometry, which can lead to degradation in accuracy when a different array is used in testing. This makes it challenging to deploy these systems in practice, as it is costly to retrain and deploy different models for various array configurations. To address this, we present a simple and effective data augmentation technique, which is based on randomly dropping channels in the multi-channel audio input during training, in order to improve the robustness to various array configurations at test time. We call this technique ChannelAugment, in contrast to SpecAugment (SA) which drops time and/or frequency components of a single channel input audio. We apply ChannelAugment to the Spatial Filtering (SF) and Minimum Variance Distortionless Response (MVDR) neural beamforming approaches. For SF, we observe 10.6 % WER improvement across various array configurations employing different numbers of microphones. For MVDR, we achieve a 74 % reduction in training time without causing degradation of recognition accuracy.

show abstract

“…A student model is then trained on the augmented training data including both labeled and pseudo-parallel data to obtain a model that is expected to generalize better to the target domain. ST has recently shown excellent performance for neural sequence generation tasks such as machine translation [15] and ASR [16][17][18], achieving stateof-the-art performance for semi-supervised ASR when applied in an iterative manner [19]. Classical works in ST [20][21][22] suggest that its performance is not stable if the generated pseudo-labels are highly erroneous, and hence ST is often accompanied by a filtering process to remove such pseudo-labeled utterances from the training data.…”

Section: Introductionmentioning

confidence: 99%

Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training

Khurana

Moritz

Hori

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The performance of automatic speech recognition (ASR) systems typically degrades significantly when the training and test data domains are mismatched. In this paper, we show that self-training (ST) combined with an uncertainty-based pseudo-label filtering approach can be effectively used for domain adaptation. We propose DUST, a dropout-based uncertainty-driven self-training technique which uses agreement between multiple predictions of an ASR system obtained for different dropout settings to measure the model's uncertainty about its prediction. DUST excludes pseudo-labeled data with high uncertainties from the training, which leads to substantially improved ASR results compared to ST without filtering, and accelerates the training time due to a reduced training data set. Domain adaptation experiments using WSJ as a source domain and TED-LIUM 3 as well as SWITCHBOARD as the target domains show that up to 80% of the performance of a system trained on ground-truth data can be recovered.

show abstract

Semi-Supervised Learning with Data Augmentation for End-to-End ASR

Cited by 22 publications

References 35 publications

Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition

Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition

ChannelAugment: Improving generalization of multi-channel ASR by training with input channel randomization

Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training

Contact Info

Product

Resources

About