Multilingual Sequence-to-Sequence Speech Recognition: Architecture, Transfer Learning, and Language Modeling

Cho, Jaejin; Baskar, Murali Karthick; Li, Ruizhi; Wiesner, Matthew; Mallidi, Sri Harish; Yalta, Nelson; Karafiát, Martin; Watanabe, Shinji; Hori, Takaaki

doi:10.1109/slt.2018.8639655

Cited by 105 publications

(87 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the MLASR transfer learning scenario, the base MLASR model was trained exactly same as in [25]. The base model was trained using 10 selected Babel languages, which are roughly 640 hours of data: Cantonese, Bengali, Pashto, Turkish, Vietnamese, Haitian, Tamil, Kurmanji, Tokpisin, and Georgian.…”

Section: Methodsmentioning

confidence: 99%

Language Model Integration Based on Memory Control for Sequence to Sequence Speech Recognition

Cho

Watanabe

Hori

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

In this paper, we explore several new schemes to train a seq2seq model to integrate a pre-trained LM. Our proposed fusion methods focus on the memory cell state and the hidden state in the seq2seq decoder long short-term memory (LSTM), and the memory cell state is updated by the LM unlike the prior studies. This means the memory retained by the main seq2seq would be adjusted by the external LM. These fusion methods have several variants depending on the architecture of this memory cell update and the use of memory cell and hidden states which directly affects the final label inference. We performed the experiments to show the effectiveness of the proposed methods in a mono-lingual ASR setup on the Librispeech corpus and in a transfer learning setup from a multilingual ASR (MLASR) base model to a low-resourced language. In Librispeech, our best model improved WER by 3.7%, 2.4% for test clean, test other relatively to the shallow fusion baseline, with multi-level decoding. In transfer learning from an MLASR base model to the IARPA Babel Swahili model, the best scheme improved the transferred model on eval set by 9.9%, 9.8% in CER, WER relatively to the 2-stage transfer baseline.

show abstract

Section: Methodsmentioning

confidence: 99%

Language Model Integration Based on Memory Control for Sequence to Sequence Speech Recognition

Cho

Watanabe

Hori

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…III-D, CNN layers are often used together with BLSTM layers on top to extract frame-wise hidden vectors. We explore two types of encoder structures: BLSTM (RNN-based) and VGGBLSTM (CNN-RNN-based) [44]:…”

Section: E Multi-encoder Multi-arraymentioning

confidence: 99%

Multi-Stream End-to-End Speech Recognition

Wang

Mallidi

et al. 2020

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Attention-based methods and Connectionist Temporal Classification (CTC) network have been promising research directions for end-to-end (E2E) Automatic Speech Recognition (ASR). The joint CTC/Attention model has achieved great success by utilizing both architectures during multi-task training and joint decoding. In this work, we present a multi-stream framework based on joint CTC/Attention E2E ASR with parallel streams represented by separate encoders aiming to capture diverse information. On top of the regular attention networks, the Hierarchical Attention Network (HAN) is introduced to steer the decoder toward the most informative encoders. A separate CTC network is assigned to each stream to force monotonic alignments. Two representative framework have been proposed and discussed, which are Multi-Encoder Multi-Resolution (MEM-Res) framework and Multi-Encoder Multi-Array (MEM-Array) framework, respectively. In MEM-Res framework, two heterogeneous encoders with different architectures, temporal resolutions and separate CTC networks work in parallel to extract complimentary information from same acoustics. Experiments are conducted on Wall Street Journal (WSJ) and CHiME-4, resulting in relative Word Error Rate (WER) reduction of 18.0 − 32.1% and the best WER of 3.6% in the WSJ eval92 test set. The MEM-Array framework aims at improving the farfield ASR robustness using multiple microphone arrays which are activated by separate encoders. Compared with the best singlearray results, the proposed framework has achieved relative WER reduction of 3.7% and 9.7% in AMI and DIRHA multiarray corpora, respectively, which also outperforms conventional fusion strategies.

show abstract

“…As [15] has shown, the architecture we employ adheres to the latency constraints required for interactive applications. In constrast, prior E2E multilingual work has been limited to attention-based models that do not admit a straightforward streaming implementation [10][11][12][13].…”

Section: *Equal Contributionmentioning

confidence: 99%

“…More recently, end-to-end (E2E) multilingual systems have gained traction as a way to further simplify the training and serving of such models. These models replace the acoustic, pronunciation, and language models of n different languages with a single model while continuing to show improved performance over monolingual E2E systems [10][11][12][13]. Even as these E2E systems have shown promising results, it has not been conclusively demonstrated that they can be competitive with state-ofthe-art conventional models, nor that they can do so while still operating within the real-time constraints of interactive applications such as a speech-enabled assistant.…”

Section: Introductionmentioning

confidence: 99%

Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

et al. 2019

View full text Add to dashboard Cite

Multilingual end-to-end (E2E) models have shown great promise in expansion of automatic speech recognition (ASR) coverage of the world's languages. They have shown improvement over monolingual systems, and have simplified training and serving by eliminating language-specific acoustic, pronunciation, and language models. This work presents an E2E multilingual system which is equipped to operate in low-latency interactive applications, as well as handle a key challenge of real world data: the imbalance in training data across languages. Using nine Indic languages, we compare a variety of techniques, and find that a combination of conditioning on a language vector and training language-specific adapter layers produces the best model. The resulting E2E multilingual model achieves a lower word error rate (WER) than both monolingual E2E models (eight of nine languages) and monolingual conventional systems (all nine languages).

show abstract

Multilingual Sequence-to-Sequence Speech Recognition: Architecture, Transfer Learning, and Language Modeling

Cited by 105 publications

References 30 publications

Language Model Integration Based on Memory Control for Sequence to Sequence Speech Recognition

Language Model Integration Based on Memory Control for Sequence to Sequence Speech Recognition

Multi-Stream End-to-End Speech Recognition

Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

Contact Info

Product

Resources

About