2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018
DOI: 10.1109/icassp.2018.8461886
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Dialect Speech Recognition with a Single Sequence-to-Sequence Model

Abstract: Sequence-to-sequence models provide a simple and elegant solution for building speech recognition systems by folding separate components of a typical system, namely acoustic (AM), pronunciation (PM) and language (LM) models into a single neural network. In this work, we look at one such sequence-to-sequence model, namely listen, attend and spell (LAS) [1], and explore the possibility of training a single model to serve different English dialects, which simplifies the process of training multi-dialect systems w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
87
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
3
2

Relationship

2
6

Authors

Journals

citations
Cited by 102 publications
(89 citation statements)
references
References 26 publications
(44 reference statements)
1
87
1
Order By: Relevance
“…Its size is fixed regardless of the number of variants. As a simple strategy to improve robustness to different accents, we explore including additional training data from different Englishaccented locales, using the same data as described in [13]. Specifically, we use data from Australia, New-Zealand, United Kingdom, Ireland, India, Kenya, Nigeria and South Africa.…”
Section: Robustness To Accentsmentioning
confidence: 99%
See 1 more Smart Citation
“…Its size is fixed regardless of the number of variants. As a simple strategy to improve robustness to different accents, we explore including additional training data from different Englishaccented locales, using the same data as described in [13]. Specifically, we use data from Australia, New-Zealand, United Kingdom, Ireland, India, Kenya, Nigeria and South Africa.…”
Section: Robustness To Accentsmentioning
confidence: 99%
“…Conventional models handle this by using a lexicon that can have multiple pronunciations for a word. Since our E2E models directly predict word-pieces [12], we address this by including accented English data from different locales [13]. Third, given the increased audio-text pairs used in training, we explore using a constant learning rate rather than gradually decaying the learning rate over time, thereby giving even weight to the training examples as training progresses.…”
Section: Introductionmentioning
confidence: 99%
“…B1 is an accentindependent model which is trained on the data from all the accents. B2 and B3 have shown strong performance on multi-accent speech recognition in [7]. Specifically, we append accent labels at the end of each label sequence and B2 is trained on the updated sequences from all accents.…”
Section: Baselinesmentioning
confidence: 99%
“…Our experiments demonstrate that the combination of a language vector and adapter modules yields the best multilingual E2E system. While previous works have investigated various aspects of data sampling [16,17], as well as architectures that include a language vector [11,18,19], this is the first study to apply adapter modules [20] to speech recognition.…”
Section: *Equal Contributionmentioning
confidence: 99%
“…At inference time, we assume the language is either specfied in the user's preferences, or determined automatically from a language identification system. Various methods of using a language vector have been previously described and directly compared in non-streaming E2E multilingual [11] and multidialect [18] models. The language itself can be represented in several different ways (as a one-hot vector, as an embedding vector, or as a combination of clusters learned through cluster adaptive training (CAT) [23]), but prior work [18,19] has shown that the simple approach of a one-hot vector performs as well as and sometimes better than the more complex methods.…”
Section: Conditioning On Language Vectormentioning
confidence: 99%