The Zero Resource Speech Challenge 2020: Discovering Discrete Subword and Word Units

Dunbar, Ewan; Karadayi, Julien; Bernard, Mathieu; Cao, Xuan-Nga; Algayres, Robin; Ondel, Lucas; Besacier, Laurent; Sakti, Sakriani; Dupoux, Emmanuel

doi:10.21437/interspeech.2020-2743

Cited by 47 publications

(44 citation statements)

References 9 publications

(19 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• VQ-VAE [38]: a variational auto-encoder with a quantization layer; variations of this model were successfully used for AUD by several teams in recent iterations of the Zero Resource Challenge [12], [7], [39], [40]. Keeping with our theme of using English as a development language, we tuned the VQ-VAE hyper-parameters to maximize the NMI on English and transferred them to the other languages • constrained VQ-VAE [41]: a recently proposed postprocessing method for VQ-VAE which encourages temporally consecutive frames to be quantized to the same class; this was shown to provide a significant improvement over the the plain VQ-VAE [41] • ResDAVEnet-VQ [14]: neural network with quantization layers trained to correlate images with their associated audio captions; we choose this baseline to compare our method against an AUD system with a weak supervision signal • VQ-WAV2VEC [13]: a convolutional neural network with a quantization layer trained with a contrastive prediction objective on the 960 hour Librispeech corpus [42].…”

Section: F Comparison With Other Methodsmentioning

confidence: 99%

Non-Parametric Bayesian Subspace Models for Acoustic Unit Discovery

Ondel¹

2021

Preprint

Self Cite

View full text Add to dashboard Cite

This work investigates subspace non-parametric models for the task of learning a set of acoustic units from unlabeled speech recordings. We constrain the base-measure of a Dirichlet-Process mixture with a phonetic subspace---estimated from other source languages---to build an \emph{educated prior}, thereby forcing the learned acoustic units to resemble phones of known source languages. Two types of models are proposed: (i) the Subspace HMM (SHMM) which assumes that the phonetic subspace is the same for every language, (ii) the Hierarchical-Subspace HMM (H-SHMM) which relaxes this assumption and allows to have a language-specific subspace estimated on the unlabeled target data. These models are applied on 3 languages: English, Yoruba and Mboshi and they are compared with various competitive acoustic units discovery baselines. Experimental results show that both subspace models outperform other systems in terms of clustering quality and segmentation accuracy. Moreover, we observe that the H-SHMM provides results superior to the SHMM supporting the idea that language-specific priors are preferable to language-agnostic priors for acoustic unit discovery.

show abstract

Section: F Comparison With Other Methodsmentioning

confidence: 99%

Non-Parametric Bayesian Subspace Models for Acoustic Unit Discovery

Ondel¹

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Evaluation of unsupervised features Unsupervised features can be evaluated with two kinds of methods, depending of the end goal of these features. In the zero resource setting [26,27], the aim is to build speech representations without any labels. Distance-based methods like ABX [28,29] or Mean Average Precision [30] evaluate the intrinsic quality of the features without having to retrain the system on any label.…”

Section: Related Workmentioning

confidence: 99%

Towards Unsupervised Learning of Speech Features in the Wild

Rivière

Dupoux

2021

2021 IEEE Spoken Language Technology Workshop (SLT)

Self Cite

View full text Add to dashboard Cite

Recent work on unsupervised contrastive learning of speech representation has shown promising results, but so far has mostly been applied to clean, curated speech datasets. Can it also be used with unprepared audio data "in the wild"? Here, we explore three potential problems in this setting: (i) presence of non-speech data, (ii) noisy or low quality speech data, and (iii) imbalance in speaker distribution. We show that on the Libri-light train set, which is itself a relatively clean speech-only dataset, these problems combined can already have a performance cost of up to 30% relative for the ABX score. We show that the first two problems can be alleviated by data filtering, with voice activity detection selecting speech segments, while perplexity of a model trained with clean data helping to discard entire files. We show that the third problem can be alleviated by learning a speaker embedding in the predictive branch of the model. We show that these techniques build more robust speech features that can be transferred to an ASR task in the low resource setting.

show abstract

“…In self-supervised learning for zero-resource speech modeling [15], [23], [24], [29], [38], [39], targets that a model is trained to predict are computed from the data itself [40]. A typical self-supervised representation learning model is the vector-quantized variational autoencoder (VQ-VAE) [15], which achieved a fairly good performance in ZeroSpeech 2017 [41] and 2019 [9], and has become more widely adopted [42]- [44] in the latest ZeroSpeech 2020 challenge [45]. Other selfsupervised learning algorithms such as factorized hierarchical VAE (FHVAE) [46], contrastive predictive coding (CPC) [23] and APC [29] were also extensively investigated in unsupervised subword modeling [30], [42], [47], [48] as well as in a relevant zero-resource word discrimination task [49].…”

Section: A Unsupervised Learning Techniquesmentioning

confidence: 99%

“…The z 1 representation from a well-trained FHVAE is extracted as the desired speaker-invariant phonetic representation for unsupervised subword modeling. The FHVAE model was applied in [10] and achieved good performance in the ZeroSpeech 2019 Challenge [60], which is why we compare the APC model against FHVAE in this study. Details of the FHVAE model description is provided in supplementary material (see Section S1-A).…”

Section: Comparative Approaches 1) Fhvaementioning

confidence: 99%

See 1 more Smart Citation

The Effectiveness of Unsupervised Subword Modeling With Autoregressive and Cross-Lingual Phone-Aware Networks

Feng

Scharenborg

2021

IEEE Open J. Signal Process.

View full text Add to dashboard Cite

This study addresses unsupervised subword modeling, i.e., learning acoustic feature representations that can distinguish between subword units of a language. We propose a two-stage learning framework that combines self-supervised learning and cross-lingual knowledge transfer. The framework consists of autoregressive predictive coding (APC) as the front-end and a cross-lingual deep neural network (DNN) as the back-end. Experiments on the ABX subword discriminability task conducted with the Libri-light and ZeroSpeech 2017 databases showed that our approach is competitive or superior to state-of-the-art studies. Comprehensive and systematic analyses at the phoneme-and articulatory feature (AF)-level showed that our approach was better at capturing diphthong than monophthong vowel information, while also differences in the amount of information captured for different types of consonants were observed. Moreover, a positive correlation was found between the effectiveness of the back-end in capturing a phoneme's information and the quality of the cross-lingual phone labels assigned to the phoneme. The AF-level analysis together with t-SNE visualization results showed that the proposed approach is better than MFCC and APC features in capturing manner and place of articulation information, vowel height, and backness information. Taken together, the analyses showed that the two stages in our approach are both effective in capturing phoneme and AF information. Nevertheless, monophthong vowel information is less well captured than consonant information, which suggests that future research should focus on improving capturing monophthong vowel information.

show abstract

The Zero Resource Speech Challenge 2020: Discovering Discrete Subword and Word Units

Cited by 47 publications

References 9 publications

Non-Parametric Bayesian Subspace Models for Acoustic Unit Discovery

Non-Parametric Bayesian Subspace Models for Acoustic Unit Discovery

Towards Unsupervised Learning of Speech Features in the Wild

The Effectiveness of Unsupervised Subword Modeling With Autoregressive and Cross-Lingual Phone-Aware Networks

Contact Info

Product

Resources

About