Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning

Hou, Wenxin; Dong, Yue; Zhuang, Bairong; Yang, Longfei; Shi, Jiatong; Shinozaki, Takahiro

doi:10.21437/interspeech.2020-2164

Cited by 55 publications

(50 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…A spectrogram is one of the most used visual input representations of speech signals in speech analysis tasks, such as ASR [ 36 ] and SER [ 24 ] using deep learning (DL) models. It demonstrates the signal strength over time at different frequencies present in a particular waveform.…”

Section: Proposed Age and Gender Classification Methodologymentioning

confidence: 99%

Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms

Tursunov

Mustaqeem

Choeh

et al. 2021

Sensors

View full text Add to dashboard Cite

Speech signals are being used as a primary input source in human–computer interaction (HCI) to develop several applications, such as automatic speech recognition (ASR), speech emotion recognition (SER), gender, and age recognition. Classifying speakers according to their age and gender is a challenging task in speech processing owing to the disability of the current methods of extracting salient high-level speech features and classification models. To address these problems, we introduce a novel end-to-end age and gender recognition convolutional neural network (CNN) with a specially designed multi-attention module (MAM) from speech signals. Our proposed model uses MAM to extract spatial and temporal salient features from the input data effectively. The MAM mechanism uses a rectangular shape filter as a kernel in convolution layers and comprises two separate time and frequency attention mechanisms. The time attention branch learns to detect temporal cues, whereas the frequency attention module extracts the most relevant features to the target by focusing on the spatial frequency features. The combination of the two extracted spatial and temporal features complements one another and provide high performance in terms of age and gender classification. The proposed age and gender classification system was tested using the Common Voice and locally developed Korean speech recognition datasets. Our suggested model achieved 96%, 73%, and 76% accuracy scores for gender, age, and age-gender classification, respectively, using the Common Voice dataset. The Korean speech recognition dataset results were 97%, 97%, and 90% for gender, age, and age-gender recognition, respectively. The prediction performance of our proposed model, which was obtained in the experiments, demonstrated the superiority and robustness of the tasks regarding age, gender, and age-gender recognition from speech signals.

show abstract

Section: Proposed Age and Gender Classification Methodologymentioning

confidence: 99%

Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms

Tursunov

Mustaqeem

Choeh

et al. 2021

Sensors

View full text Add to dashboard Cite

show abstract

“…For every target language, a subword vocabulary of size 100 is generated using the SentencePiece [19] toolkit. We employ the aforementioned subword-based LID-42 model presented in [7] as the pre-trained multilingual ASR model, which consists of 12 encoder layers and 6 decoder layers with a model dimension of 256. The number of multihead attention heads is 4 and the inner-dimension of the feedforward network is 2048.…”

Section: Implementation Detailsmentioning

confidence: 99%

“…Pretap et al [6] introduced a massive single E2E model with up to 1 billion parameters trained on 50 languages. Nearly at the same time, Hou et al [7] reported a super language-independent Transformerbased ASR model (LID-42) jointly trained on 6 million training utterances from 42 languages with hybrid CTC-attention multi-task learning [8]. Both of them achieved a significant recognition accuracy improvement on low-resource ASR via transfer learning.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Meta-Adapter: Efficient Cross-Lingual Adaptation With Meta-Learning

Hou

Wang

Gao

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Transfer learning from a multilingual model has shown favorable results on low-resource automatic speech recognition (ASR). However, full-model fine-tuning generates a separate model for every target language and is not suitable for deploying and maintaining in production. The key challenge lies in how to efficiently extend the pre-trained model with fewer parameters. In this paper, we propose to combine the adapter module with meta-learning algorithms to achieve high recognition performance under low-resource settings and improve the parameter-efficiency of the model. Extensive experiments show that our methods can achieve comparable or even superior recognition rates than the state-of-the-art baselines on low-resource languages, especially under very-low-resource conditions, with a significantly smaller model profile.

show abstract

“…With the exception of a few recent works [5,6,7], most previous work on multilingual speech recognition focuses on the benefits of these models for lower-resource or related languages. Nevertheless, in order for these models to be utilized in real-world scenarios and replace their monolingual counterparts, they need to target a variety of languages, with large • Introduction of an informed mixture-of-experts layer, used in the encoder of an RNN-T model, where each expert is assigned to one language, or set of related languages.…”

Section: Introductionmentioning

confidence: 99%

Mixture of Informed Experts for Multilingual Speech Recognition

Gaur

Farris

Haghani

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

When trained on related or low-resource languages, multilingual speech recognition models often outperform their monolingual counterparts. However, these models can suffer from loss in performance for high resource or unrelated languages. We investigate the use of a mixture-of-experts approach to assign per-language parameters in the model to increase network capacity in a structured fashion. We introduce a novel variant of this approach, 'informed experts', which attempts to tackle inter-task conflicts by eliminating gradients from other tasks in these task-specific parameters. We conduct experiments on a real-world task with English, French and four dialects of Arabic to show the effectiveness of our approach. Our model matches or outperforms the monolingual models for almost all languages, with gains of as much as 31% relative. Our model also outperforms the baseline multilingual model for all languages by up to 9% relative.Index Termsend-to-end speech recognition, multilingual, RNN-T, language id, mixture of experts * The first two authors have equal contribution. The rest of the list is sorted alphabetically.variations in amounts of training data. In this paper, we propose one multilingual model to transcribe languages with varied amounts of training data. We use the mixture-of-experts approach (MOE) [8] and adapt it to exploit the inherent structure of the data to simultaneously learn per-language experts.

show abstract

Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning

Cited by 55 publications

References 0 publications

Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms

Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms

Meta-Adapter: Efficient Cross-Lingual Adaptation With Meta-Learning

Mixture of Informed Experts for Multilingual Speech Recognition

Contact Info

Product

Resources

About