Kaizen: Continuously Improving Teacher Using Exponential Moving Average for Semi-Supervised Speech Recognition

Manohar, Vimal; Likhomanenko, T.; Xu, Qiantong; Hsu, Wei-Ning; Collobert, Ronan; Saraf, Yatharth; Zweig, Geoffrey; Mohamed, Abdelrahman

doi:10.1109/asru51503.2021.9688028

Cited by 14 publications

(9 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…transcription generated by some method. Different ways of inferring pseudo-labels PL(x; θ ) have been proposed [22,31,38,26,29,18,7], including both greedy and beam-search decoding, with or without an external LM, and with variants on the "teacher" AM model θ . IPL [38] and slimIPL [26] are continuous PL approaches, where a single AM (with parameters θ) is continuously trained.…”

Section: Acoustic (Am) and Language (Lm) Modelsmentioning

confidence: 99%

“…The two dominant methods for leveraging unlabeled audio are unsupervised pre-training via selfsupervision (SSL) [6,19,11,4] and semi-supervised self-training [22,38,26,29,16,18], or pseudo-labeling (PL). In pre-training, a model is trained to process the raw unlabeled data to extract features that solve some pretext task, followed by supervised fine-tuning on some downstream ASR task.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

slimIPL: Language-Model-Free Iterative Pseudo-Labeling

Likhomanenko

Kahn

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

Recent results in end-to-end ASR have demonstrated the efficacy of simple pseudo-labeling for semisupervised models trained both with Connectionist Temporal Classification (CTC) and Sequenceto-Sequence (seq2seq) losses. Iterative Pseudo-Labeling (IPL), which continuously trains a single model using pseudo-labels iteratively re-generated as the model learns, has been shown to further increase performance in ASR. We improve upon the IPL algorithm: as the model learns, we propose to iteratively re-generate transcriptions with hard labels (the most probable tokens) assignments, that is without a language model. We call this approach Language-Model-Free IPL (slimIPL) and we give a resultant training setup for CTC and seq2seq models. At inference, our experiments show that decoding with a strong language model is more beneficial with slimIPL than IPL, as IPL exhibits some language model over-fitting issues. Compared to prior work on semi-supervised and unsupervised approaches, slimIPL not only simplifies the training process, but also achieves competitive and state-of-the-art results on LIBRISPEECH test sets in both standard and low-resource settings.

show abstract

Section: Acoustic (Am) and Language (Lm) Modelsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

slimIPL: Language-Model-Free Iterative Pseudo-Labeling

Likhomanenko

Kahn

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

show abstract

“…This can be seen as alternative caching mechanism to [42] for exploiting older models. A similar approach to MPL was proposed in [59], which focused on lower-resource settings and conducted experiments on a hybrid ASR system in addition to a CTCbased end-to-end system. This paper thoroughly investigates MPL on its robustness against variations in domain mismatch severity and over-fitting to LM knowledge.…”

Section: B Pseudo-labeling With Multiple Iterationsmentioning

confidence: 99%

Momentum Pseudo-Labeling: Semi-Supervised ASR With Continuously Improving Pseudo-Labels

Higuchi

Moritz

Roux

et al. 2022

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

End-to-end automatic speech recognition (ASR) has become a popular alternative to traditional module-based systems, simplifying the model-building process with a single deep neural network architecture. However, the training of end-toend ASR systems is generally data-hungry: a large amount of labeled data (speech-text pairs) is necessary to learn direct speech-to-text conversion effectively. To make the training less dependent on labeled data, pseudo-labeling, a semi-supervised learning approach, has been successfully introduced to end-toend ASR, where a seed model is self-trained with pseudo-labels generated from unlabeled (speech-only) data. Here, we propose momentum pseudo-labeling (MPL), a simple yet effective strategy for semi-supervised ASR. MPL consists of a pair of online and offline models that interact and learn from each other, inspired by the mean teacher method. The online model is trained to predict pseudo-labels generated on the fly by the offline model. The offline model maintains an exponential moving average of the online model parameters. The interaction between the two models allows better ASR training on unlabeled data by continuously improving the quality of pseudo-labels. We apply MPL to a connectionist temporal classification-based model and evaluate it on various semi-supervised scenarios with varying amounts of data or domain mismatch. The results demonstrate that MPL significantly improves the seed model by stabilizing the training on unlabeled data. Moreover, we present additional techniques, e.g., the use of Conformer and an external language model, to further enhance MPL, which leads to better performance than other semi-supervised methods based on pseudo-labeling.

show abstract

“…We adopt continuous PL (shown in Fig. 2c) [23,24] to compute the L ASR in both stage 1 and stage 2. Note that the continuous PL approach could also be used in other FL approaches like FedNorm and FedExtract.…”

Section: Unsupervised Training With Continuous Plmentioning

confidence: 99%

“…Then, some training burdens are moved to the server, thus reducing computation on clients. Additionally, DecoupleFL adopts pseudo-labeling (PL) approaches [23,24] for unsupervised learning, avoiding the unrealistic labeled data assumption. Moreover, one potential concern is communicating features might lead to privacy leakage.…”

Section: Introductionmentioning

confidence: 99%

Decoupled Federated Learning for ASR with Non-IID Data

Han¹,

Wang²,

Cheng³

et al. 2022

Preprint

View full text Add to dashboard Cite

Automatic speech recognition (ASR) with federated learning (FL) makes it possible to leverage data from multiple clients without compromising privacy. The quality of FL-based ASR could be measured by recognition performance, communication and computation costs. When data among different clients are not independently and identically distributed (non-IID), the performance could degrade significantly. In this work, we tackle the non-IID issue in FL-based ASR with personalized FL, which learns personalized models for each client. Concretely, we propose two types of personalized FL approaches for ASR. Firstly, we adapt the personalization layer based FL for ASR, which keeps some layers locally to learn personalization models. Secondly, to reduce the communication and computation costs, we propose decoupled federated learning (DecoupleFL). On one hand, DecoupleFL moves the computation burden to the server, thus decreasing the computation on clients. On the other hand, DecoupleFL communicates secure high-level features instead of model parameters, thus reducing communication cost when models are large. Experiments demonstrate two proposed personalized FL-based ASR approaches could reduce WER by 2.3% -3.4% compared with FedAvg. Among them, DecoupleFL has only 11.4% communication and 75% computation cost compared with FedAvg, which is also significantly less than the personalization layer based FL.

show abstract

Kaizen: Continuously Improving Teacher Using Exponential Moving Average for Semi-Supervised Speech Recognition

Cited by 14 publications

References 28 publications

slimIPL: Language-Model-Free Iterative Pseudo-Labeling

slimIPL: Language-Model-Free Iterative Pseudo-Labeling

Momentum Pseudo-Labeling: Semi-Supervised ASR With Continuously Improving Pseudo-Labels

Decoupled Federated Learning for ASR with Non-IID Data

Contact Info

Product

Resources

About