Cross-language bootstrapping for unsupervised acoustic model training: rapid development of a Polish speech recognition system

Lööf, Jonas; Gollan, Christian; Ney, Hermann

doi:10.21437/interspeech.2009-20

Cited by 34 publications

(9 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Past research [5,6,7] addressed this problem, finding that existing resources for other languages can be leveraged to pretrain, or bootstrap, an acoustic model, and then adapt it to the target language, given a small quantity of adaptation data.…”

Section: Introductionmentioning

confidence: 99%

That Sounds Familiar: An Analysis of Phonetic Representations Transfer Across Languages

et al. 2020

View full text Add to dashboard Cite

Only a handful of the world's languages are abundant with the resources that enable practical applications of speech processing technologies. One of the methods to overcome this problem is to use the resources existing in other languages to train a multilingual automatic speech recognition (ASR) model, which, intuitively, should learn some universal phonetic representations. In this work, we focus on gaining a deeper understanding of how general these representations might be, and how individual phones are getting improved in a multilingual setting. To that end, we select a phonetically diverse set of languages, and perform a series of monolingual, multilingual and crosslingual (zero-shot) experiments. The ASR is trained to recognize the International Phonetic Alphabet (IPA) token sequences. We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting, where the model, among other errors, considers Javanese as a tone language. Notably, as little as 10 hours of the target language training data tremendously reduces ASR error rates. Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages -an encouraging result for the lowresource speech community.

show abstract

Section: Introductionmentioning

confidence: 99%

That Sounds Familiar: An Analysis of Phonetic Representations Transfer Across Languages

et al. 2020

View full text Add to dashboard Cite

show abstract

“…Unsupervised training is commonly associated in the literature with transcribing speech in a language A, using an ASR system trained in a language B, as in Ragni et al (2014); Lööf et al (2009); Qian et al (2013). However, this need not be the case.…”

Section: Unsupervised Trainingmentioning

confidence: 99%

Data Augmentation for Speech Recognition in Maltese: A Low-Resource Perspective

Mena¹,

DeMarco²,

Borg³

et al. 2021

Preprint

View full text Add to dashboard Cite

Developing speech technologies is a challenge for low-resource languages for which both annotated and raw speech data is sparse. Maltese is one such language. Recent years have seen an increased interest in the computational processing of Maltese, including speech technologies, but resources for the latter remain sparse. In this paper, we consider data augmentation techniques for improving speech recognition for such languages, focusing on Maltese as a test case. We consider three different types of data augmentation: unsupervised training, multilingual training and the use of synthesized speech as training data. The goal is to determine which of these techniques, or combination of them, is the most effective to improve speech recognition for languages where the starting point is a small corpus of approximately 7 hours of transcribed speech. Our results show that combining the three data augmentation techniques studied here lead us to an absolute WER improvement of 15% without the use of a language model.

show abstract

“…As an alternative to such training of an ASR system from IL speech, we opted for a transfer learning paradigm and started with models trained on one or more higher-resource language(s). Other previous approaches [5,6,7,8] have explored cross-language ASR transfer Figure 1: Using the English SF-Type classifier to obtain adaptation/training data assuming shared phonemic representations, generally using the GlobalPhone corpus [9], while [10] examines multilingual training of a deep neural networks. Unlike these approaches, which had on the order of hours of target language speech, we are dealing with only minutes of adaptation speech.…”

Section: A Small Amount Of Il-english Parallel Textmentioning

confidence: 99%

Automatic Speech Recognition and Topic Identification from Speech for Almost-Zero-Resource Languages

et al. 2018

View full text Add to dashboard Cite

Automatic speech recognition (ASR) systems often need to be developed for extremely low-resource languages to serve enduses such as audio content categorization and search. While universal phone recognition is natural to consider when no transcribed speech is available to train an ASR system in a language, adapting universal phone models using very small amounts (minutes rather than hours) of transcribed speech also needs to be studied, particularly with state-of-the-art DNN-based acoustic models. The DARPA LORELEI program provides a framework for such very-low-resource ASR studies, and provides an extrinsic metric for evaluating ASR performance in a humanitarian assistance, disaster relief setting. This paper presents our Kaldi-based systems for the program, which employ a universal phone modeling approach to ASR, and describes recipes for very rapid adaptation of this universal ASR system. The results we obtain significantly outperform results obtained by many competing approaches on the NIST LoReHLT 2017 Evaluation datasets.

show abstract

Cross-language bootstrapping for unsupervised acoustic model training: rapid development of a Polish speech recognition system

Cited by 34 publications

References 12 publications

That Sounds Familiar: An Analysis of Phonetic Representations Transfer Across Languages

That Sounds Familiar: An Analysis of Phonetic Representations Transfer Across Languages

Data Augmentation for Speech Recognition in Maltese: A Low-Resource Perspective

Automatic Speech Recognition and Topic Identification from Speech for Almost-Zero-Resource Languages

Contact Info

Product

Resources

About