A Hierarchical Subspace Model for Language-Attuned Acoustic Unit Discovery

Yusuf, Bolaji; Ondel, Lucas; Burget, Lukáš; Černocký, Jaň; Saraçlar, Murat

doi:10.1109/icassp39728.2021.9414899

Cited by 7 publications

(15 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, several recent studies have attempted a combination of Kempton's approach with the unsupervised clustering approach: cross-lingual ASR is used to annotate the articulatory features of an unknown language, which are then clustered to form unsupervised phonelike units [56,57]. To our knowledge, only two of these papers [71,72] directly evaluated phone inventory NMI or F1; using oracle cluster combination strategies that are standard in the field of unsupervised phone discovery, [72] achieved F1=64.14% for cross-lingual automatic phone inventory estimation.…”

Section: Related Workmentioning

confidence: 99%

Discovering Phonetic Inventories with Crosslingual Automatic Speech Recognition

Żelasko¹,

Feng²,

Moro-Velázquez³

et al. 2022

Preprint

View full text Add to dashboard Cite

The high cost of data acquisition makes Automatic Speech Recognition (ASR) model training problematic for most existing languages, including languages that do not even have a written script, or for which the phone inventories remain unknown. Past works explored multilingual training, transfer learning, as well as zero-shot learning in order to build ASR systems for these low-resource languages. While it has been shown that the pooling of resources from multiple languages is helpful, we have not yet seen a successful application of an ASR model to a language unseen during training. A crucial step in the adaptation of ASR from seen to unseen languages is the creation of the phone inventory of the unseen language.The ultimate goal of our work is to build the phone inventory of a language unseen during training in an unsupervised way without any knowledge about the language. In this paper, we 1) investigate the influence of different factors (i.e., model architecture, phonotactic model, type of speech representation) on phone recognition in an unknown language; 2) provide an analysis of which phones transfer well across languages and which do not in order to understand the limitations of and areas for further improvement for automatic phone inventory creation; and 3) present different methods to build a phone inventory of an unseen language in an unsupervised way. To that

show abstract

Section: Related Workmentioning

confidence: 99%

Discovering Phonetic Inventories with Crosslingual Automatic Speech Recognition

Żelasko¹,

Feng²,

Moro-Velázquez³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In [21], a multilingual AUD system is constructed which defines a subspace of AUs which is learned in a supervised way from multilingual data in an attempt to capture the commonalities on what an AU is across different languages. They aim at providing a better prior for the AU learning, while we are concerned with removing speaker dependence.…”

Section: Related Workmentioning

confidence: 99%

“…In the case of TIMIT, the proper phone-level transcriptions are used. For Yoruba and Mboshi, forced alignments were provided by the authors of [21]. The databases for AUD are deliberately chosen as in [21] and [2] to provide comparability.…”

Section: Speech Databasesmentioning

confidence: 99%

“…For each metric, higher values are better. Firstly, the (symmetric) normalized mutual information (NMI) is used as defined in [21]: NMI = 200 % I(U ;P ) H(U )+H(P ) , where U are the extracted AUs, P are the reference phones, I(•; •) is the mutual information between the label sets and H(•) is the entropy. The calculation is based on a frame-wise comparison of all proposed and reference transcriptions, calculating a confusion matrix and estimating the joint probability distribution between the two label sets.…”

Section: Performance Metricsmentioning

confidence: 99%

“…As the third measure, we calculate the phone boundary Fscore (F-score), which assesses the agreement of the discovered AU boundaries with the phone boundaries provided by the database (TIMIT) or by forced alignment (Yoruba, Mboshi), using a collar of ±20 ms as in [21]. This metric measures the segmentation performance.…”

Section: Performance Metricsmentioning

confidence: 99%

See 2 more Smart Citations

Voice Conversion Based Speaker Normalization for Acoustic Unit Discovery

Glarner¹,

Ebbers²,

Häb‐Umbach³

2021

Preprint

View full text Add to dashboard Cite

Discovering speaker independent acoustic units purely from spoken input is known to be a hard problem. In this work we propose an unsupervised speaker normalization technique prior to unit discovery. It is based on separating speaker related from content induced variations in a speech signal with an adversarial contrastive predictive coding approach. This technique does neither require transcribed speech nor speaker labels, and, furthermore, can be trained in a multilingual fashion, thus achieving speaker normalization even if only few unlabeled data is available from the target language. The speaker normalization is done by mapping all utterances to a medoid style which is representative for the whole database. We demonstrate the effectiveness of the approach by conducting acoustic unit discovery with a hidden Markov model variational autoencoder noting, however, that the proposed speaker normalization can serve as a front end to any unit discovery system. Experiments on English, Yoruba and Mboshi show improvements compared to using non-normalized input.

show abstract

Self-Supervised Language Learning From Raw Audio: Lessons From the Zero Resource Speech Challenge

Dunbar

Hamilakis²,

Dupoux³

2022

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

Recent progress in self-supervised or unsupervised machine learning has opened the possibility of building a full speech processing system from raw audio without using any textual representations or expert labels such as phonemes, dictionaries or parse trees. The contribution of the Zero Resource Speech Challenge series since 2015 has been to break down this long-term objective into four well-defined tasks-Acoustic Unit Discovery, Spoken Term Discovery, Discrete Resynthesis, and Spoken Language Modeling-and introduce associated metrics and benchmarks enabling model comparison and cumulative progress. We present an overview of the six editions of this challenge series since 2015, discuss the lessons learned, and outline the areas which need more work or give puzzling results.

show abstract

A Hierarchical Subspace Model for Language-Attuned Acoustic Unit Discovery

Cited by 7 publications

References 18 publications

Discovering Phonetic Inventories with Crosslingual Automatic Speech Recognition

Discovering Phonetic Inventories with Crosslingual Automatic Speech Recognition

Voice Conversion Based Speaker Normalization for Acoustic Unit Discovery

Self-Supervised Language Learning From Raw Audio: Lessons From the Zero Resource Speech Challenge

Contact Info

Product

Resources

About