Towards Zero-Shot Learning for Automatic Phonemic Transcription

Li, Xinjian; Dalmia, Siddharth; Mortensen, David R.; Li, Juncheng; Black, Alan W.; Metze, Florian

doi:10.1609/aaai.v34i05.6341

Cited by 24 publications

(17 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…More broadly there has been extensive work on domain adaptation and transfer learning in machine learning, reviewed by Kouw and Loog [46]. This includes work on few-shot learning [47]- [49] and normalizing flows [50], [51]. Normalizing flows which provide a probabilistic framework for feature transformations, were first developed for speech recognition as Gaussianization [52], and more recently have been applied to speech synthesis [53] and voice transformation [54].…”

Section: Adaptation and Transfer Learning In Related Fieldsmentioning

confidence: 99%

Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview

Bell

Fainberg

Klejch

et al. 2021

IEEE Open J. Signal Process.

View full text Add to dashboard Cite

Section: Adaptation and Transfer Learning In Related Fieldsmentioning

confidence: 99%

Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview

Bell

Fainberg

Klejch

et al. 2021

IEEE Open J. Signal Process.

View full text Add to dashboard Cite

“…One critical issue with most multilingual recognition models is that their phone coverage is hardly complete [19]. For example, our trained model could cover around 200 phones, whereas the Phoible inventory has around 2000 distinct phones.…”

Section: Phone Recognitionmentioning

confidence: 99%

Phone Distribution Estimation for Low Resource Languages

Yao

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Phones are critical components in various computational linguistic fields, for example, phone distributions could be helpful in speech recognition and speech synthesis. Traditional approaches to estimate phone distributions typically involve G2P systems which are either manually designed by linguists or trained on large datasets. These prohibitive requirements make research on low resource languages extremely challenging. In this work, we propose a novel approach to estimate phone distributions by only requiring raw audio datasets: We first estimate the phone ranks by combining language-independent recognition results and Learning to Rank results. Next, we approximate the distribution with Expectation-Maximization by fitting Yule distribution. The results on 7 languages show the joint-model has better performance in both ranking estimation and distribution estimation tasks.

show abstract

“…Zero-shot transfer learning addresses this by training a single multilingual model on the labeled data of several languages to enable zero-shot transcription of unseen languages [16,17,18,19,17,20]. Models usually have a common encoder that extracts acoustic information from speech audio and then predict either a shared phoneme vocabulary [17,16] or language-specific phonemes [1,20,21]. The former requires either phonological units that are agnostic to any particular language such as articulatory features [20] or global phones [22,17].…”

Section: Introductionmentioning

confidence: 99%

Simple and Effective Zero-shot Cross-lingual Phoneme Recognition

Xu¹,

Baevski²,

Auli³

2021

Preprint

View full text Add to dashboard Cite

Recent progress in self-training, self-supervised pretraining and unsupervised learning enabled well performing speech recognition systems without any labeled data. However, in many cases there is labeled data available for related languages which is not utilized by these methods. This paper extends previous work on zero-shot cross-lingual transfer learning by fine-tuning a multilingually pretrained wav2vec 2.0 model to transcribe unseen languages. This is done by mapping phonemes of the training languages to the target language using articulatory features. Experiments show that this simple method significantly outperforms prior work which introduced task-specific architectures and used only part of a monolingually pretrained model.

show abstract

Towards Zero-Shot Learning for Automatic Phonemic Transcription

Cited by 24 publications

References 18 publications

Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview

Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview

Phone Distribution Estimation for Low Resource Languages

Simple and Effective Zero-shot Cross-lingual Phoneme Recognition

Contact Info

Product

Resources

About