Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1170
| View full text |Cite
|
Sign up to set email alerts
|

Abstract: This study addresses unsupervised subword modeling, i.e., learning acoustic feature representations that can distinguish between subword units of a language. We propose a two-stage learning framework that combines self-supervised learning and cross-lingual knowledge transfer. The framework consists of autoregressive predictive coding (APC) as the front-end and a cross-lingual deep neural network (DNN) as the back-end. Experiments on the ABX subword discriminability task conducted with the Libri-light and ZeroS… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
33
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
4
1

Relationship

3
2

Authors

Journals

citations
Cited by 6 publications
(35 citation statements)
references
References 55 publications
(202 reference statements)
2
33
0
Order By: Relevance
“…This demonstrates the effectiveness of the front-end APC pretraining in our proposed two-stage system framework. This observation confirms our earlier findings in recent work [30], in which the training set for APC was unlab-600 (526 hours). The present study shows that when the amount of training material is scaled up to unlab-6K (5,273 hours), APC pretraining brings even greater relative ABX error rate reduction than when trained on unlab-600: the across-and within-speaker relative error rate reductions from M-BNF-Du to A-BNF-Du are 9.5% and 5.5%, respectively, when trained with unlab-6K, while they are 7.6% and 4.8% when trained with unlab-600.…”
Section: B Effectiveness Of the Proposed Approachsupporting
confidence: 93%
See 4 more Smart Citations
“…This demonstrates the effectiveness of the front-end APC pretraining in our proposed two-stage system framework. This observation confirms our earlier findings in recent work [30], in which the training set for APC was unlab-600 (526 hours). The present study shows that when the amount of training material is scaled up to unlab-6K (5,273 hours), APC pretraining brings even greater relative ABX error rate reduction than when trained on unlab-600: the across-and within-speaker relative error rate reductions from M-BNF-Du to A-BNF-Du are 9.5% and 5.5%, respectively, when trained with unlab-6K, while they are 7.6% and 4.8% when trained with unlab-600.…”
Section: B Effectiveness Of the Proposed Approachsupporting
confidence: 93%
“…A typical self-supervised representation learning model is the vector-quantized variational autoencoder (VQ-VAE) [15], which achieved a fairly good performance in ZeroSpeech 2017 [41] and 2019 [9], and has become more widely adopted [42]- [44] in the latest ZeroSpeech 2020 challenge [45]. Other selfsupervised learning algorithms such as factorized hierarchical VAE (FHVAE) [46], contrastive predictive coding (CPC) [23] and APC [29] were also extensively investigated in unsupervised subword modeling [30], [42], [47], [48] as well as in a relevant zero-resource word discrimination task [49].…”
Section: A Unsupervised Learning Techniquesmentioning
confidence: 99%
See 3 more Smart Citations