Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1337
|View full text |Cite
|
Sign up to set email alerts
|

Combining Adversarial Training and Disentangled Speech Representation for Robust Zero-Resource Subword Modeling

Abstract: This study addresses the problem of unsupervised subword unit discovery from untranscribed speech. It forms the basis of the ultimate goal of ZeroSpeech 2019, building text-to-speech systems without text labels. In this work, unit discovery is formulated as a pipeline of phonetically discriminative feature learning and unit inference. One major difficulty in robust unsupervised feature learning is dealing with speaker variation. Here the robustness towards speaker variation is achieved by applying adversarial … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
21
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
3

Relationship

3
4

Authors

Journals

citations
Cited by 17 publications
(21 citation statements)
references
References 25 publications
(38 reference statements)
0
21
0
Order By: Relevance
“…The z 1 representation from a well-trained FHVAE is extracted as the desired speaker-invariant phonetic representation for unsupervised subword modeling. The FHVAE model was applied in [10] and achieved good performance in the ZeroSpeech 2019 Challenge [49], which is why we compare the APC model against FHVAE in this study. Details of the FHVAE model description is provided in supplementary material (see Section S1-A).…”
Section: Comparative Approaches 1) Fhvaementioning
confidence: 99%
See 2 more Smart Citations
“…The z 1 representation from a well-trained FHVAE is extracted as the desired speaker-invariant phonetic representation for unsupervised subword modeling. The FHVAE model was applied in [10] and achieved good performance in the ZeroSpeech 2019 Challenge [49], which is why we compare the APC model against FHVAE in this study. Details of the FHVAE model description is provided in supplementary material (see Section S1-A).…”
Section: Comparative Approaches 1) Fhvaementioning
confidence: 99%
“…The phoneme-level analysis uses the 39 English phonemes in the CMU Dictionary [61]: these are 10 monophthongs, 5 diphthongs, and 24 consonants. Calculation of p co (ω i ) (see Equation (10)) depends on the ground-truth English phoneme labels and the cross-lingual phone labels. The English true phoneme labels for dev-clean are obtained by carrying out a forced alignment using the English TDNN AM that is described in Section V-B.…”
Section: A Setupmentioning
confidence: 99%
See 1 more Smart Citation
“…Those clusters and speaker ids are trained with adversarial multi-task learning to get a final representation. The system has its primary representation (FHVAE(b)) and an alternative (FHVAE(a)) [58].…”
Section: E Dpgmm-rnn Hybrid Model In Zerospeech 2019mentioning
confidence: 99%
“…Even though our scenario has the same objective as the challenge, the approach is a little different. The participant of the challenge are encouraged to develop intra-language unsupervised unit discovery methods, which are more difficult [32,33,34]. Our framework is bootstrapped from an abundant language, which is a more practical approach [35,36].…”
Section: Using Voice Conversion System In An Unseen Languagementioning
confidence: 99%