Unsupervised Subword Modeling Using Autoregressive Pretraining and Cross-Lingual Phone-Aware Modeling

Feng, Siyuan; Scharenborg, Odette

doi:10.21437/interspeech.2020-1170

Cited by 6 publications

(35 citation statements)

References 55 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This demonstrates the effectiveness of the front-end APC pretraining in our proposed two-stage system framework. This observation confirms our earlier findings in recent work [30], in which the training set for APC was unlab-600 (526 hours). The present study shows that when the amount of training material is scaled up to unlab-6K (5,273 hours), APC pretraining brings even greater relative ABX error rate reduction than when trained on unlab-600: the across-and within-speaker relative error rate reductions from M-BNF-Du to A-BNF-Du are 9.5% and 5.5%, respectively, when trained with unlab-6K, while they are 7.6% and 4.8% when trained with unlab-600.…”

Section: B Effectiveness Of the Proposed Approachsupporting

confidence: 93%

“…A typical self-supervised representation learning model is the vector-quantized variational autoencoder (VQ-VAE) [15], which achieved a fairly good performance in ZeroSpeech 2017 [41] and 2019 [9], and has become more widely adopted [42]- [44] in the latest ZeroSpeech 2020 challenge [45]. Other selfsupervised learning algorithms such as factorized hierarchical VAE (FHVAE) [46], contrastive predictive coding (CPC) [23] and APC [29] were also extensively investigated in unsupervised subword modeling [30], [42], [47], [48] as well as in a relevant zero-resource word discrimination task [49].…”

Section: A Unsupervised Learning Techniquesmentioning

confidence: 99%

“…At the second stage, the back-end, a cross-lingual, OOD DNN model with a bottleneck layer (DNN-BNF) is trained using the APC pretrained features as the input features to create the missing (due to the zero-resource assumption) frame labels. This system framework was proposed in our recent study [30], and showed state-of-the-art performances on the subword discriminability task on two databases in UAM: ZeroSpeech 2017 [17] and Libri-light [21].…”

Section: Introductionmentioning

confidence: 99%

“…In this work, we expand and extend the work in [30]. Specifically, we (1) compare the proposed approach to a supervised topline system that is trained on transcribed data of the target language; (2) compare the proposed approach with another cross-lingual knowledge transfer method [27];…”

Section: Introductionmentioning

confidence: 99%

“…(3) investigate the potential of our approach in relation to the amount of unlabeled training material by varying the data between 500 hours (as used in [30]) and 5, 000 hours, and compare the models' performance to the topline model. Throughout our experiments, English is chosen as the target low-resource language.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

The Effectiveness of Unsupervised Subword Modeling With Autoregressive and Cross-Lingual Phone-Aware Networks

Feng

Scharenborg

2021

IEEE Open J. Signal Process.

Self Cite

View full text Add to dashboard Cite

This study addresses unsupervised subword modeling, i.e., learning acoustic feature representations that can distinguish between subword units of a language. We propose a two-stage learning framework that combines self-supervised learning and cross-lingual knowledge transfer. The framework consists of autoregressive predictive coding (APC) as the front-end and a cross-lingual deep neural network (DNN) as the back-end. Experiments on the ABX subword discriminability task conducted with the Libri-light and ZeroSpeech 2017 databases showed that our approach is competitive or superior to state-of-the-art studies. Comprehensive and systematic analyses at the phoneme-and articulatory feature (AF)-level showed that our approach was better at capturing diphthong than monophthong vowel information, while also differences in the amount of information captured for different types of consonants were observed. Moreover, a positive correlation was found between the effectiveness of the back-end in capturing a phoneme's information and the quality of the cross-lingual phone labels assigned to the phoneme. The AF-level analysis together with t-SNE visualization results showed that the proposed approach is better than MFCC and APC features in capturing manner and place of articulation information, vowel height, and backness information. Taken together, the analyses showed that the two stages in our approach are both effective in capturing phoneme and AF information. Nevertheless, monophthong vowel information is less well captured than consonant information, which suggests that future research should focus on improving capturing monophthong vowel information.

show abstract

Section: B Effectiveness Of the Proposed Approachsupporting

confidence: 93%

Section: A Unsupervised Learning Techniquesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

The Effectiveness of Unsupervised Subword Modeling With Autoregressive and Cross-Lingual Phone-Aware Networks

Feng

Scharenborg

2021

IEEE Open J. Signal Process.

Self Cite

View full text Add to dashboard Cite

show abstract

Similarity Analysis of Self-Supervised Speech Representations

Chung¹,

Belinkov

Glass³

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Self-supervised speech representation learning has recently been a prosperous research topic. Many algorithms have been proposed for learning useful representations from large-scale unlabeled data, and their applications to a wide range of speech tasks have also been investigated. However, there has been little research focusing on understanding the properties of existing approaches. In this work, we aim to provide a comparative study of some of the most representative self-supervised algorithms. Specifically, we quantify the similarities between different self-supervised representations using existing similarity measures. We also design probing tasks to study the correlation between the models' pretraining loss and the amount of specific speech information contained in their learned representations. In addition to showing how various self-supervised models behave differently given the same input, our study also finds that the training objective has a higher impact on representation similarity than architectural choices such as building blocks (RNN/Transformer/CNN) and directionality (uni/bidirectional). Our results also suggest that there exists a strong correlation between pre-training loss and downstream performance for some self-supervised algorithms.

show abstract

The effectiveness of self-supervised representation learning in zero-resource subword modeling

Feng

Scharenborg

2021

2021 55th Asilomar Conference on Signals, Systems, and Computers

Self Cite

View full text Add to dashboard Cite

For a language with no transcribed speech available (the zero-resource scenario), conventional acoustic modeling algorithms are not applicable. Recently, zero-resource acoustic modeling has gained much interest. One research problem is unsupervised subword modeling (USM), i.e., learning a feature representation that can distinguish subword units and is robust to speaker variation. Previous studies showed that self-supervised learning (SSL) has the potential to separate speaker and phonetic information in speech in an unsupervised manner, which is highly desired in USM. This paper compares two representative SSL algorithms, namely, contrastive predictive coding (CPC) and autoregressive predictive coding (APC), as a front-end method of a recently proposed, state-of-the art two-stage approach, to learn a representation as input to a back-end cross-lingual DNN. Experiments show that the bottleneck features extracted by the back-end achieved state of the art in a subword ABX task on the Libri-light and ZeroSpeech databases. In general, CPC is more effective than APC as the front-end in our approach, which is independent of the choice of the out-domain language identity in the back-end cross-lingual DNN and the training data amount. With very limited training data, APC is found similar or more effective than CPC when test data consists of long utterances.

show abstract

Unsupervised Subword Modeling Using Autoregressive Pretraining and Cross-Lingual Phone-Aware Modeling

Cited by 6 publications

References 55 publications

The Effectiveness of Unsupervised Subword Modeling With Autoregressive and Cross-Lingual Phone-Aware Networks

The Effectiveness of Unsupervised Subword Modeling With Autoregressive and Cross-Lingual Phone-Aware Networks

Similarity Analysis of Self-Supervised Speech Representations

The effectiveness of self-supervised representation learning in zero-resource subword modeling

Contact Info

Product

Resources

About