Multi-Dialect Speech Recognition with a Single Sequence-to-Sequence Model

Li, Bo; Sainath, Tara N.; Sim, Khe Chai; Bacchiani, Michiel; Weinstein, Eugene; Nguyen, Patrick; Chen, Zhifeng; Wu, Yanghui; Rao, Kanishka

doi:10.1109/icassp.2018.8461886

Cited by 102 publications

(89 citation statements)

References 26 publications

(44 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Its size is fixed regardless of the number of variants. As a simple strategy to improve robustness to different accents, we explore including additional training data from different Englishaccented locales, using the same data as described in [13]. Specifically, we use data from Australia, New-Zealand, United Kingdom, Ireland, India, Kenya, Nigeria and South Africa.…”

Section: Robustness To Accentsmentioning

confidence: 99%

“…Conventional models handle this by using a lexicon that can have multiple pronunciations for a word. Since our E2E models directly predict word-pieces [12], we address this by including accented English data from different locales [13]. Third, given the increased audio-text pairs used in training, we explore using a constant learning rate rather than gradually decaying the learning rate over time, thereby giving even weight to the training examples as training progresses.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency

Sainath

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

178

127

View full text Add to dashboard Cite

Thus far, end-to-end (E2E) models have not been shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains [1] to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the speed of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model. For example, for the same latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more than 400-times smaller in model size.

show abstract

Section: Robustness To Accentsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency

Sainath

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

178

127

View full text Add to dashboard Cite

show abstract

“…B1 is an accentindependent model which is trained on the data from all the accents. B2 and B3 have shown strong performance on multi-accent speech recognition in [7]. Specifically, we append accent labels at the end of each label sequence and B2 is trained on the updated sequences from all accents.…”

Section: Baselinesmentioning

confidence: 99%

Aipnet: Generative Adversarial Pre-Training of Accent-Invariant Networks for End-To-End Speech Recognition

Chen

Zhao-jun

Yeh

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

As one of the major sources in speech variability, accents have posed a grand challenge to the robustness of speech recognition systems. In this paper, our goal is to build a unified end-to-end speech recognition system that generalizes well across accents. For this purpose, we propose a novel pre-training framework AIPNet based on generative adversarial nets (GAN) for accent-invariant representation learning: Accent Invariant Pre-training Networks. We pre-train AIPNet to disentangle accent-invariant and accent-specific characteristics from acoustic features through adversarial training on accented data for which transcriptions are not necessarily available. We further fine-tune AIPNet by connecting the accent-invariant module with an attention-based encoder-decoder model for multiaccent speech recognition. In the experiments, our approach is compared against four baselines including both accent-dependent and accent-independent models. Experimental results on 9 English accents show that the proposed approach outperforms all the baselines by 2.3 ∼ 4.5% relative reduction on average WER when transcriptions are available in all accents and by 1.6 ∼ 6.1% relative reduction when transcriptions are only available in US accent.

show abstract

“…Our experiments demonstrate that the combination of a language vector and adapter modules yields the best multilingual E2E system. While previous works have investigated various aspects of data sampling [16,17], as well as architectures that include a language vector [11,18,19], this is the first study to apply adapter modules [20] to speech recognition.…”

Section: *Equal Contributionmentioning

confidence: 99%

“…At inference time, we assume the language is either specfied in the user's preferences, or determined automatically from a language identification system. Various methods of using a language vector have been previously described and directly compared in non-streaming E2E multilingual [11] and multidialect [18] models. The language itself can be represented in several different ways (as a one-hot vector, as an embedding vector, or as a combination of clusters learned through cluster adaptive training (CAT) [23]), but prior work [18,19] has shown that the simple approach of a one-hot vector performs as well as and sometimes better than the more complex methods.…”

Section: Conditioning On Language Vectormentioning

confidence: 99%

Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

et al. 2019

Self Cite

View full text Add to dashboard Cite

Multilingual end-to-end (E2E) models have shown great promise in expansion of automatic speech recognition (ASR) coverage of the world's languages. They have shown improvement over monolingual systems, and have simplified training and serving by eliminating language-specific acoustic, pronunciation, and language models. This work presents an E2E multilingual system which is equipped to operate in low-latency interactive applications, as well as handle a key challenge of real world data: the imbalance in training data across languages. Using nine Indic languages, we compare a variety of techniques, and find that a combination of conditioning on a language vector and training language-specific adapter layers produces the best model. The resulting E2E multilingual model achieves a lower word error rate (WER) than both monolingual E2E models (eight of nine languages) and monolingual conventional systems (all nine languages).

show abstract

Multi-Dialect Speech Recognition with a Single Sequence-to-Sequence Model

Cited by 102 publications

References 26 publications

A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency

A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency

Aipnet: Generative Adversarial Pre-Training of Accent-Invariant Networks for End-To-End Speech Recognition

Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

Contact Info

Product

Resources

About