An Investigation of Multilingual ASR Using End-to-end LF-MMI

Tong, Sibo; Garner, Philip N.; Bourlard, Hervé

doi:10.1109/icassp.2019.8683338

Cited by 15 publications

(13 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…6 shows framework of the chain model. It uses a method of sequence-discriminative training and the objective function we used in the training is LF-MMI (Lattice-Free Maximum Mutual Information) [37], [38], which aims to maximize the probability of the target sequence, while minimizing the probability of all other sequences:…”

Section: A Asr Modelsmentioning

confidence: 99%

LMC-SMCA: A New Active Learning Method in ASR

et al. 2021

View full text Add to dashboard Cite

In Automatic Speech Recognition (ASR), transcribed data take substantial effort to obtain. It is worthwhile to explore how to selective the samples with more information from un-transcribed datapool to get a better model with the limited cost. Therefore, active learning in ASR becomes a research topic. In this manuscript, we proposed two new methods of active learning. One is Signal-Model Committee Approach (SMCA) and the other is LM-based Certainty Approach (LMCA). These two methods respectively evaluate the information amount of samples from different angles and can be applied together for joint sampling in some scenarios. We conducted many comparative experiments on Listen, Attend and Spell (LAS) model according to different demands. In experiments, we compared our approach with the random sampling and another state-of-the-art committee-based approach: heterogeneous neural networks (HNN) based approach. We examined our approach in CER in Chinese Mandarin speech recognition task. The results show that proposed approach is not only simple to use, but also has the best performance.INDEX TERMS Speech recognition, active learning, committee-based, certainty-based methods.

show abstract

Section: A Asr Modelsmentioning

confidence: 99%

LMC-SMCA: A New Active Learning Method in ASR

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Earlier studies in multilingual and crosslingual recognition use context-dependent phone units, which leads to an explosion of units and also needs special care to handle context-dependent modeling across languages [26,27]. There are recent attempts to use end-toend ASR models such as CTC with monophones [8,25] or end-toend LF-MMI with biphones [28,27] for multilingual and crosslingual recognition. Remarkably, the end-to-end CTC-CRF model, which is defined by a CRF (conditional random field) with CTC topology, has been shown to perform significantly better than CTC [20,21].…”

Section: Related Workmentioning

confidence: 99%

Multilingual and crosslingual speech recognition using phonological-vector based phone embeddings

Zhu¹,

An²,

Zheng³

et al. 2021

Preprint

View full text Add to dashboard Cite

The use of phonological features (PFs) potentially allows language-specific phones to remain linked in training, which is highly desirable for information sharing for multilingual and crosslingual speech recognition methods for low-resourced languages. A drawback suffered by previous methods in using phonological features is that the acoustic-to-PF extraction in a bottom-up way is itself difficult. In this paper, we propose to join phonology driven phone embedding (top-down) and deep neural network (DNN) based acoustic feature extraction (bottom-up) to calculate phone probabilities. The new method is called JoinAP (Joining of Acoustics and Phonology). Remarkably, no inversion from acoustics to phonological features is required for speech recognition. For each phone in the IPA (International Phonetic Alphabet) table, we encode its phonological features to a phonological-vector, and then apply linear or nonlinear transformation of the phonological-vector to obtain the phone embedding. A series of multilingual and crosslingual (both zero-shot and few-shot) speech recognition experiments are conducted on the CommonVoice dataset (German, French, Spanish and Italian) and the AISHLL-1 dataset (Mandarin), and demonstrate the superiority of JoinAP with nonlinear phone embeddings over both JoinAP with linear phone embeddings and the traditional method with flat phone embeddings.

show abstract

“…The reduced target set modeling refers to employing a lesser number of target labels than those involved in the combined target set modeling based E2E code-switching ASR system. Recently, in the context of the multilingual ASR task [45], the authors successfully used the union of phone sets of the underlying languages as targets to the E2E ASR system instead of the combined character set. Motivated by that, in an earlier work [28], we had defined a common phone set having 62 labels that cover both Hindi and English languages.…”

Section: B Reduced Target Set Modelingmentioning

confidence: 99%

Exploration of End-to-End Framework for Code-Switching Speech Recognition Task: Challenges and Enhancements

Sreeram

Sinha

2020

IEEE Access

View full text Add to dashboard Cite

The end-to-end (E2E) framework has emerged as a viable alternative to conventional hybrid systems in automatic speech recognition (ASR) domain. Unlike the monolingual case, the challenges faced by an E2E system in code-switching ASR task include (i) the expansion of target set to account for multiple languages involved, (ii) the requirement of a robust target-to-word (T2W) transduction, and (iii) the need for more effective context modeling. In this paper, we aim to address those challenges for reliable training of the E2E ASR system on a limited amount of code-switching data. The main contribution of this work lies in the E2E target set reduction by exploiting the acoustic similarity and the proposal of a novel context-dependent T2W transduction scheme. Additionally, a novel textual feature has been proposed to enhance the context modeling in the case of code-switching data. The experiments are performed on a recently created Hindi-English code-switching corpus. For contrast purposes, the existing combined target set based system is also evaluated. The proposed system outperforms the existing one and yields a target error rate of 18.1% along with a word error rate of 29.79%. INDEX TERMS Code-switching, speech recognition, end-to-end system, factored language model, targetto-word transduction.

show abstract

An Investigation of Multilingual ASR Using End-to-end LF-MMI

Cited by 15 publications

References 20 publications

LMC-SMCA: A New Active Learning Method in ASR

LMC-SMCA: A New Active Learning Method in ASR

Multilingual and crosslingual speech recognition using phonological-vector based phone embeddings

Exploration of End-to-End Framework for Code-Switching Speech Recognition Task: Challenges and Enhancements

Contact Info

Product

Resources

About