Neural Language Codes for Multilingual Acoustic Models

Müller, Markus; Stüker, Sebastian; Waibel, Alex

doi:10.21437/interspeech.2018-1241

Cited by 6 publications

(6 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the world of speech recognition, training a single recognizer for multiple languages is not a thematic stranger [3] from Hidden Markov Model (HMM) based models [17,18], hybrid models [19] to end-to-end neural based models with CTC [20,21] or sequence-to-sequence models [22,5,23,24,25,26], with the last approach being inspired by the success of multilingual machine translation [1,2]. The literature especially mentions the merits of disclosing the language identity (when the utterance is supposed to belong to a single language) to the model, whose architecture is designed to incorporate the language information.…”

Section: Related Work and Comparisonmentioning

confidence: 99%

“…One of the manifestations is language gating from either language embeddings [21] or language codes [20,27] that aim at selecting a subset of the neurons in the network hidden layer. In our current approach, this effect can be achieved by factorizing further Equation 15 [15]:…”

Section: Related Work and Comparisonmentioning

confidence: 99%

“…A different line of research involves using language code [20] to differentiate language coming from a separate classifier. The language code is often trained separately and then mixed into the ASR architecture later [27] giving the lingual bias.…”

Section: Related Work and Comparisonmentioning

confidence: 99%

See 2 more Smart Citations

Efficient Weight factorization for Multilingual Speech Recognition

Pham¹,

Nguyen²,

Stueker³

et al. 2021

Preprint

View full text Add to dashboard Cite

End-to-end multilingual speech recognition involves using a single model training on a compositional speech corpus including many languages, resulting in a single neural network to handle transcribing different languages. Due to the fact that each language in the training data has different characteristics, the shared network may struggle to optimize for all various languages simultaneously. In this paper we propose a novel multilingual architecture that targets the core operation in neural networks: linear transformation functions. The key idea of the method is to assign fast weight matrices for each language by decomposing each weight matrix into a shared component and a language dependent component. The latter is then factorized into vectors using rank-1 assumptions to reduce the number of parameters per language. This efficient factorization scheme is proved to be effective in two multilingual settings with 7 and 27 languages, reducing the word error rates by 26% and 27% rel. for two popular architectures LSTM and Transformer, respectively.

show abstract

Section: Related Work and Comparisonmentioning

confidence: 99%

Section: Related Work and Comparisonmentioning

confidence: 99%

See 1 more Smart Citation

Efficient Weight factorization for Multilingual Speech Recognition

Pham¹,

Nguyen²,

Stueker³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In studies using popular end-to-end architectures such as RNN-T or Listen, Attend and Spell (LAS) [15], multilingual ASR performance is usually enhanced by providing auxiliary language inputs to the model [16][17][18][19][20][21]. Depending on whether or not the language spoken is known beforehand at runtime, language inputs can be provided in the form of constant one-hot vectors or on-the-fly prediction vectors (e.g., posteriors from a streaming LID model), respectively.…”

Section: Introductionmentioning

confidence: 99%

Joint ASR and Language Identification Using RNN-T: An Efficient Approach to Dynamic Language Switching

Punjabi

Arsikere

Raeesy

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Conventional dynamic language switching enables seamless multilingual interactions by running several monolingual ASR systems in parallel and triggering the appropriate downstream components using a standalone language identification (LID) service. Since this solution is neither scalable nor cost-and memory-efficient, especially for on-device applications, we propose end-to-end, streaming, joint ASR-LID architectures based on the recurrent neural network transducer framework. Two key formulations are explored: (1) joint training using a unified output space for ASR and LID vocabularies, and (2) joint training viewed as multi-task optimization. We also evaluate the benefit of using auxiliary language information obtained on-thefly from an acoustic LID classifier. Experiments with the English-Hindi language pair show that: (a) multi-task architectures perform better overall, and (b) the best joint architecture surpasses monolingual ASR (6.4-9.2% word error rate reduction) and acoustic LID (53.9-56.1% error rate reduction) baselines while reducing the overall memory footprint by up to 46%.

show abstract

“…Automatic speech recognition (ASR) systems are becoming increasingly ubiquitous in today's world as more and more mobile devices, home appliances and automobiles add ASR capabilities. Although many improvements have been made in multi-dialect [1,2], multi-accent [3,4] and even truly multilingual [5,6,7] ASR in recent years, they often only support a small subset of languages [8]. In order to get a satisfactory Word Error Rate (WER) for a larger range of languages, language identification (LID) models have been combined with monolingual ASR systems to allow utterance-level switching for a larger set of languages [9] with reasonable accuracy, even over a set of up to 8 candidate languages.…”

Section: Introductionmentioning

confidence: 99%

Improving Language Identification for Multilingual Speakers

Titus

Silovský

Chen

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Spoken language identification (LID) technologies have improved in recent years from discriminating largely distinct languages to discriminating highly similar languages or even dialects of the same language. One aspect that has been mostly neglected, however, is discrimination of languages for multilingual speakers, despite being a primary target audience of many systems that utilize LID technologies. As we show in this work, LID systems can have a high average accuracy for most combinations of languages while greatly underperforming for others when accented speech is present. We address this by using coarser-grained targets for the acoustic LID model and integrating its outputs with interaction context signals in a context-aware model to tailor the system to each user. This combined system achieves an average 97% accuracy across all language combinations while improving worst-case accuracy by over 60% relative to our baseline.

show abstract

Neural Language Codes for Multilingual Acoustic Models

Cited by 6 publications

References 16 publications

Efficient Weight factorization for Multilingual Speech Recognition

Efficient Weight factorization for Multilingual Speech Recognition

Joint ASR and Language Identification Using RNN-T: An Efficient Approach to Dynamic Language Switching

Improving Language Identification for Multilingual Speakers

Contact Info

Product

Resources

About