Personalization of End-to-End Speech Recognition on Mobile Devices for Named Entities

Sim, Khe Chai; Johnson, Leif; Motta, Giovanni; Zhou, Lei; Beaufays, Françoise; Benard, Arnaud; Guliani, Dhruv; Kabel, Andreas; Khare, Nikhil; Lucassen, Tamar; Zadrazil, Petr; Zhang, Harry

doi:10.1109/asru46091.2019.9003775

Cited by 45 publications

(28 citation statements)

References 24 publications

(30 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Khodak et al [270] and Jiang et al [250] explore the connection between FL and MAML, and show how the MAML setting is a relevant framework to model the personalization objectives for FL. Chai Sim et al [102] applied local fine tuning to personalize speech recognition models in federated learning. Fallah et al [181] developed a new algorithm called Personalized FedAvg by connecting MAML instead of Reptile to federated learning.…”

Section: Local Fine Tuning and Meta-learningmentioning

confidence: 99%

Advances and Open Problems in Federated Learning

Kairouz

McMahan²

2021

FNT in Machine Learning

2,175

1,214

View full text Add to dashboard Cite

Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges.

show abstract

Section: Local Fine Tuning and Meta-learningmentioning

confidence: 99%

Advances and Open Problems in Federated Learning

Kairouz

McMahan²

2021

FNT in Machine Learning

2,175

1,214

View full text Add to dashboard Cite

show abstract

“…The models were trained using the efficient implementation [13] in TensorFlow [14]. We measured the success of the modified model using the word error rate (WER) metric as well as the name recall rate [4] as described below:…”

Section: Resultsmentioning

confidence: 99%

Low-Rank Gradient Approximation for Memory-Efficient on-Device Training of Deep Neural Network

Gooneratne

Sim

Zadrazil

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Training machine learning models on mobile devices has the potential of improving both privacy and accuracy of the models. However, one of the major obstacles to achieving this goal is the memory limitation of mobile devices. Reducing training memory enables models with high-dimensional weight matrices, like automatic speech recognition (ASR) models, to be trained on-device. In this paper, we propose approximating the gradient matrices of deep neural networks using a low-rank parameterization as an avenue to save training memory. The low-rank gradient approximation enables more advanced, memory-intensive optimization techniques to be run on device. Our experimental results show that we can reduce the training memory by about 33.0% for Adam optimization. It uses comparable memory to momentum optimization and achieves a 4.5% relative lower word error rate on an ASR personalization task.

show abstract

“…Because the size of E2E models is much smaller than that of hybrid models, E2E models have clear advantages when being deployed to device. Therefore, personalization or adaptation of E2E models [119], [120], [126], [127] is a rapidly growing area. While it possible to adapt every user's model on cloud and then push it back to device, it is more reasonable to adapt the model on device, which needs to adjust the adaptation algorithm to overcome the challenge of limited memory and computation power [119].…”

Section: Summary and Discussionmentioning

confidence: 99%

“…Because AED and RNN-T also have components corresponding to the language model, there are also techniques specific to adapting the language modeling aspect of E2E models, for instance using a text embedding instead of an acoustic embedding to bias an E2E model in order to produce outputs relevant to the particular recognition context [123]- [125]. If the new domain differs from the source domain mainly in content instead of acoustics, domain adaptation on E2E models can be performed by either interpolating the E2E model with an external language model or updating language model related components inside the E2E model with the textto-speech audio generated from the text in the new domain [126], [127], discussed in Sec. XII.…”

Section: Adaptation Algorithms For Nn-based Asrmentioning

confidence: 99%

“…In [126], [127], RNN-T models were adapted to a new domain with the TTS data generated from the domain-specific text. Because the prediction network in RNN-T works similarly to a LM, adapting it without updating the acoustic encoder is shown to be more effective than interpolating the RNN-T model with an external LM trained from the domainspecific text [127].…”

Section: Language Model Adaptationmentioning

confidence: 99%

See 1 more Smart Citation