Improving Language Identification for Multilingual Speakers

Titus, Andrew; Silovský, Jan; Chen, Nanxin; Hsiao, Roger; Young, Mary Vance; Ghoshal, Arnab

doi:10.1109/icassp40776.2020.9053057

Cited by 6 publications

(4 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They work with 11 different models and use ReLU, dropouts, Adam, batch normalization, and various other attributes to get good results. Literature [ 25 ] proposed a system with an acoustic model and a context-aware model. They created the model with 4 convolutional layers with 128 units, 4 fully connected layers with the 1024 units, 1 fully connected layer with the 512 units, 1 temporal pooling layer with a mean and standard deviation, 1 fully connected layer with 1024 units, and at last softmax function is used with unit 1.…”

Section: Related Workmentioning

confidence: 99%

Spoken Language Identification Using Deep Learning

Singh

Sharma

Kumar

et al. 2021

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

The process of detecting language from an audio clip by an unknown speaker, regardless of gender, manner of speaking, and distinct age speaker, is defined as spoken language identification (SLID). The considerable task is to recognize the features that can distinguish between languages clearly and efficiently. The model uses audio files and converts those files into spectrogram images. It applies the convolutional neural network (CNN) to bring out main attributes or features to detect output easily. The main objective is to detect languages out of English, French, Spanish, and German, Estonian, Tamil, Mandarin, Turkish, Chinese, Arabic, Hindi, Indonesian, Portuguese, Japanese, Latin, Dutch, Portuguese, Pushto, Romanian, Korean, Russian, Swedish, Tamil, Thai, and Urdu. An experiment was conducted on different audio files using the Kaggle dataset named spoken language identification. These audio files are comprised of utterances, each of them spanning over a fixed duration of 10 seconds. The whole dataset is split into training and test sets. Preparatory results give an overall accuracy of 98%. Extensive and accurate testing show an overall accuracy of 88%.

show abstract

Section: Related Workmentioning

confidence: 99%

Spoken Language Identification Using Deep Learning

Singh

Sharma

Kumar

et al. 2021

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

show abstract

“…Wan et al [21] and Mazzawi et al [22] also investigate LSTM based architectures for this dataset. Titus et al [23] explore the effect of accent in language identification performance and train models robust to accented speech.…”

Section: Related Workmentioning

confidence: 99%

Now You’re Speaking My Language: Visual Language Identification

Afouras¹,

Chung²,

Zisserman³

2020

Interspeech 2020

View full text Add to dashboard Cite

The goal of this work is to train models that can identify a spoken language just by interpreting the speaker's lip movements. Our contributions are the following: (i) we show that models can learn to discriminate among 14 different languages using only visual speech information; (ii) we compare different designs in sequence modelling and utterance-level aggregation in order to determine the best architecture for this task; (iii) we investigate the factors that contribute discriminative cues and show that our model indeed solves the problem by finding temporal patterns in mouth movements and not by exploiting spurious correlations. We demonstrate this further by evaluating our models on challenging examples from bilingual speakers.

show abstract

“…Although E2E multilingual ASR [2][3][4] and language identification [5][6][7][8][9][10][11][12][13][14] can be studied separately, there exists a large number of previous work on using LID to improve multilingual E2E ASR . One body of work shows that using oracle LID information helps multilingual models [15][16][17][18][19][20].…”

Section: Introductionmentioning

confidence: 99%

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Zhang¹,

Li²,

Sainath³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

Language identification is critical for many downstream tasks in automatic speech recognition (ASR), and is beneficial to integrate into multilingual end-to-end ASR as an additional task. In this paper, we propose to modify the structure of the cascadedencoder-based recurrent neural network transducer (RNN-T) model by integrating a per-frame language identifier (LID) predictor. RNN-T with cascaded encoders can achieve streaming ASR with low latency using first-pass decoding with no right-context, and achieve lower word error rates (WERs) using second-pass decoding with longer right-context. By leveraging such differences in the right-contexts and a streaming implementation of statistics pooling, the proposed method can achieve accurate streaming LID prediction with little extra testtime cost. Experimental results on a voice search dataset with 9 language locales shows that the proposed method achieves an average of 96.2% LID prediction accuracy and the same secondpass WER as that obtained by including oracle LID in the input.

show abstract

Improving Language Identification for Multilingual Speakers

Cited by 6 publications

References 15 publications

Spoken Language Identification Using Deep Learning

Spoken Language Identification Using Deep Learning

Now You’re Speaking My Language: Visual Language Identification

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Contact Info

Product

Resources

About