End-to-End Speech Recognition from Federated Acoustic Models

Gao, Yan; Parcollet, Titouan; Zaiem, Salah; Fernández-Marqués, Javier; Gusmão, Pedro P. B. de; Beutel, Daniel J.; Lane, Nicholas D.

doi:10.48550/arxiv.2104.14297

Cited by 4 publications

(7 citation statements)

References 26 publications

(45 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, performing on-device federated training of acoustic models has attracted considerable attention [7,9,12,23,44]. In [23], FL was employed for a keyword spotting task and the development of a wake-word detection system, whereas, [7,9] investigated the efect of non-i.i.d. distributions on the same task.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Federated Self-training for Semi-supervised Audio Recognition

Tsouvalas

Saeed

Özçelebi

2022

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

Federated Learning is a distributed machine learning paradigm dealing with decentralized and personal datasets. Since data reside on devices like smartphones and virtual assistants, labeling is entrusted to the clients or labels are extracted in an automated way. Specifically, in the case of audio data, acquiring semantic annotations can be prohibitively expensive and time-consuming. As a result, an abundance of audio data remains unlabeled and unexploited on users’ devices. Most existing federated learning approaches focus on supervised learning without harnessing the unlabeled data. In this work, we study the problem of semi-supervised learning of audio models via self-training in conjunction with federated learning. We propose FedSTAR to exploit large-scale on-device unlabeled data to improve the generalization of audio recognition models. We further demonstrate that self-supervised pre-trained models can accelerate the training of on-device models, significantly improving convergence within fewer training rounds. We conduct experiments on diverse public audio classification datasets and investigate the performance of our models under varying percentages of labeled and unlabeled data. Notably, we show that with as little as 3% labeled data available, FedSTAR on average can improve the recognition rate by 13.28% compared to the fully-supervised federated model.

show abstract

Section: Related Workmentioning

confidence: 99%

“…distributions on the same task. In [7], a highly skewed data distribution scenario was considered, where a large set of speakers used their devices to record a set of sentences. To address the challenges introduced due to the non-i.i.d.…”

Section: Related Workmentioning

confidence: 99%

Federated Self-training for Semi-supervised Audio Recognition

Tsouvalas

Saeed

Özçelebi

2022

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

show abstract

“…Recently, performing on-device federated training of acoustic models has attracted considerable attention [7,9,12,23,42]. In [23], FL was employed for a keyword spotting task and the development of a wake-word detection system, whereas, [7,9] investigated the effect of non-i.i.d. distributions on the same task.…”

Section: Related Workmentioning

confidence: 99%

Federated Self-Training for Semi-Supervised Audio Recognition

Tsouvalas¹,

Saeed²,

Özçelebi³

2021

Preprint

View full text Add to dashboard Cite

Federated Learning is a distributed machine learning paradigm dealing with decentralized and personal datasets. Since data reside on devices like smartphones and virtual assistants, labeling is entrusted to the clients or labels are extracted in an automated way. Specifically, in the case of audio data, acquiring semantic annotations can be prohibitively expensive and time-consuming. As a result, an abundance of audio data remains unlabeled and unexploited on users' devices. Most existing federated learning approaches focus on supervised learning without harnessing the unlabeled data. In this work, we study the problem of semi-supervised learning of audio models via self-training in conjunction with federated learning. We propose FedSTAR 1 to exploit large-scale on-device unlabeled data to improve the generalization of audio recognition models. We further demonstrate that self-supervised pre-trained models can accelerate the training of on-device models, significantly improving convergence to within fewer training rounds. We conduct experiments on diverse public audio classification datasets and investigate the performance of our models under varying percentages of labeled and unlabeled data. Notably, we show that with as little as 3% labeled data available, FedSTAR on average can improve the recognition rate by 13.28% compared to the fully-supervised federated model. CCS Concepts: • Computing methodologies → Semi-supervised learning settings; Neural networks; • Human-centered computing → Ubiquitous and mobile computing.

show abstract

“…FL-based adaptation for ASR models faces several unique challenges including the lack of ground truth transcriptions, high compute and cross-device network communication costs, the non independent and identical distribution of data (non-IIDness), and the difficulty of providing privacy guarantees. Several recent works have considered cross-device FL for ASR applications [14,15,16,17,18,19]. In particular, the challenge of training on non-IID data has been addressed using weighted model averaging [14,15] and federated variation noise [17].…”

Section: Introductionmentioning

confidence: 99%

Federated Domain Adaptation for ASR with Full Self-Supervision

Jia¹,

Mahadeokar²,

Zheng³

et al. 2022

Preprint

View full text Add to dashboard Cite

Cross-device federated learning (FL) protects user privacy by collaboratively training a model on user devices, therefore eliminating the need for collecting, storing, and manually labeling user data. Previous works have considered cross-device FL for automatic speech recognition (ASR), however, there are a few important challenges that have not been fully addressed. These include the lack of ground-truth ASR transcriptions, and the scarcity of compute resource and network bandwidth on edge devices. In this paper, we address these two challenges. First, we propose a federated learning system to support ondevice ASR adaptation with full self-supervision, which uses self-labeling together with data augmentation and filtering techniques. The proposed system can improve a strong Emformer-Transducer based ASR model pretrained on out-of-domain data, using in-domain audios without any ground-truth transcriptions. Second, to reduce the training cost, we propose a self-restricted RNN Transducer (SR-RNN-T) loss, a new variant of alignmentrestricted RNN-T that uses Viterbi forced-alignment from selfsupervision. To further reduce the compute and network cost, we systematically explore adapting only a subset of weights in the Emformer-Transducer. Our best training recipe achieves a 12.9% relative WER reduction over the strong out-of-domain baseline, which equals 70% of the reduction achievable with full human supervision and centralized training.

show abstract

End-to-End Speech Recognition from Federated Acoustic Models

Cited by 4 publications

References 26 publications

Federated Self-training for Semi-supervised Audio Recognition

Federated Self-training for Semi-supervised Audio Recognition

Federated Self-Training for Semi-Supervised Audio Recognition

Federated Domain Adaptation for ASR with Full Self-Supervision

Contact Info

Product

Resources

About