2018 IEEE Spoken Language Technology Workshop (SLT) 2018
DOI: 10.1109/slt.2018.8639655
|View full text |Cite
|
Sign up to set email alerts
|

Multilingual Sequence-to-Sequence Speech Recognition: Architecture, Transfer Learning, and Language Modeling

Abstract: Sequence-to-sequence (seq2seq) approach for low-resource ASR is a relatively new direction in speech research. The approach benefits by performing model training without using lexicon and alignments. However, this poses a new problem of requiring more data compared to conventional DNN-HMM systems. In this work, we attempt to use data from 10 BABEL languages to build a multilingual seq2seq model as a prior model, and then port them towards 4 other BABEL languages using transfer learning approach. We also explor… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
86
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
3
1

Relationship

2
7

Authors

Journals

citations
Cited by 105 publications
(87 citation statements)
references
References 30 publications
1
86
0
Order By: Relevance
“…In the MLASR transfer learning scenario, the base MLASR model was trained exactly same as in [25]. The base model was trained using 10 selected Babel languages, which are roughly 640 hours of data: Cantonese, Bengali, Pashto, Turkish, Vietnamese, Haitian, Tamil, Kurmanji, Tokpisin, and Georgian.…”
Section: Methodsmentioning
confidence: 99%
“…In the MLASR transfer learning scenario, the base MLASR model was trained exactly same as in [25]. The base model was trained using 10 selected Babel languages, which are roughly 640 hours of data: Cantonese, Bengali, Pashto, Turkish, Vietnamese, Haitian, Tamil, Kurmanji, Tokpisin, and Georgian.…”
Section: Methodsmentioning
confidence: 99%
“…III-D, CNN layers are often used together with BLSTM layers on top to extract frame-wise hidden vectors. We explore two types of encoder structures: BLSTM (RNN-based) and VGGBLSTM (CNN-RNN-based) [44]:…”
Section: E Multi-encoder Multi-arraymentioning
confidence: 99%
“…As [15] has shown, the architecture we employ adheres to the latency constraints required for interactive applications. In constrast, prior E2E multilingual work has been limited to attention-based models that do not admit a straightforward streaming implementation [10][11][12][13].…”
Section: *Equal Contributionmentioning
confidence: 99%
“…More recently, end-to-end (E2E) multilingual systems have gained traction as a way to further simplify the training and serving of such models. These models replace the acoustic, pronunciation, and language models of n different languages with a single model while continuing to show improved performance over monolingual E2E systems [10][11][12][13]. Even as these E2E systems have shown promising results, it has not been conclusively demonstrated that they can be competitive with state-ofthe-art conventional models, nor that they can do so while still operating within the real-time constraints of interactive applications such as a speech-enabled assistant.…”
Section: Introductionmentioning
confidence: 99%