Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-1242
|View full text |Cite
|
Sign up to set email alerts
|

An Investigation of Deep Neural Networks for Multilingual Speech Recognition Training and Adaptation

Abstract: Different training and adaptation techniques for multilingual Automatic Speech Recognition (ASR) are explored in the context of hybrid systems, exploiting Deep Neural Networks (DNN) and Hidden Markov Models (HMM). In multilingual DNN training, the hidden layers (possibly extracting bottleneck features) are usually shared across languages, and the output layer can either model multiple sets of language-specific senones or one single universal IPA-based multilingual senone set. Both architectures are investigate… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
30
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
5
4

Relationship

2
7

Authors

Journals

citations
Cited by 42 publications
(30 citation statements)
references
References 24 publications
0
30
0
Order By: Relevance
“…To alleviate the limitation in the previous approach, the final layer of the seq2seq model which is mainly responsible for classification is retrained to the target language. In previous works [11,30] related to hybrid DNN/RNN models and CTC based models [12,15] the softmax layer is only adapted. However in our case, the attention decoder and CTC decoder both have to be retrained to the target language.…”
Section: Stage 1 -Retraining Decoder Onlymentioning
confidence: 99%
“…To alleviate the limitation in the previous approach, the final layer of the seq2seq model which is mainly responsible for classification is retrained to the target language. In previous works [11,30] related to hybrid DNN/RNN models and CTC based models [12,15] the softmax layer is only adapted. However in our case, the attention decoder and CTC decoder both have to be retrained to the target language.…”
Section: Stage 1 -Retraining Decoder Onlymentioning
confidence: 99%
“…We hypothesize that biphone targets cover more variabilities compared to the corresponding monophone, especially when they are shared by mutiliple languages. As reported in [20], language-specific characteristics cannot be well modeled by an IPA-based universal network. Language adaptive training (LAT) could be a solution to better model these variabilities across languages.…”
Section: Comparison Between Ctc and End-to-end Lf-mmimentioning
confidence: 98%
“…It has been demonstrated that the layers close to the output layer are more language-related and training the last layer in a language-dependent manner can help the IPA-based multilingual system to better capture the language specificity [20]. More specifically, the output of the neural network for language s, o sL , is calculated as…”
Section: Language Adaptive Trainingmentioning
confidence: 99%
“…The outputs are mono-phone based tied states (also known as pdfs in Kaldi [23]) corresponding to each language as presented in Section 5.1. The training labels for these networks are generated using GMM-HMM based speech recognizers [24,25]. The number of classes corresponding to French, German, Portuguese, Spanish and Russian are 124, 133, 145, 130, 151 respectively.…”
Section: Neural Network Trainingmentioning
confidence: 99%