Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-2743
|View full text |Cite
|
Sign up to set email alerts
|

The Zero Resource Speech Challenge 2020: Discovering Discrete Subword and Word Units

Abstract: We present the Zero Resource Speech Challenge 2020, which aims at learning speech representations from raw audio signals without any labels. It combines the data sets and metrics from two previous benchmarks (2017 and 2019) and features two tasks which tap into two levels of speech representation. The first task is to discover low bit-rate subword representations that optimize the quality of speech synthesis; the second one is to discover word-like units from unsegmented raw speech. We present the results of t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
42
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
2
1

Relationship

3
5

Authors

Journals

citations
Cited by 47 publications
(44 citation statements)
references
References 9 publications
0
42
0
Order By: Relevance
“…• VQ-VAE [38]: a variational auto-encoder with a quantization layer; variations of this model were successfully used for AUD by several teams in recent iterations of the Zero Resource Challenge [12], [7], [39], [40]. Keeping with our theme of using English as a development language, we tuned the VQ-VAE hyper-parameters to maximize the NMI on English and transferred them to the other languages • constrained VQ-VAE [41]: a recently proposed postprocessing method for VQ-VAE which encourages temporally consecutive frames to be quantized to the same class; this was shown to provide a significant improvement over the the plain VQ-VAE [41] • ResDAVEnet-VQ [14]: neural network with quantization layers trained to correlate images with their associated audio captions; we choose this baseline to compare our method against an AUD system with a weak supervision signal • VQ-WAV2VEC [13]: a convolutional neural network with a quantization layer trained with a contrastive prediction objective on the 960 hour Librispeech corpus [42].…”
Section: F Comparison With Other Methodsmentioning
confidence: 99%
“…• VQ-VAE [38]: a variational auto-encoder with a quantization layer; variations of this model were successfully used for AUD by several teams in recent iterations of the Zero Resource Challenge [12], [7], [39], [40]. Keeping with our theme of using English as a development language, we tuned the VQ-VAE hyper-parameters to maximize the NMI on English and transferred them to the other languages • constrained VQ-VAE [41]: a recently proposed postprocessing method for VQ-VAE which encourages temporally consecutive frames to be quantized to the same class; this was shown to provide a significant improvement over the the plain VQ-VAE [41] • ResDAVEnet-VQ [14]: neural network with quantization layers trained to correlate images with their associated audio captions; we choose this baseline to compare our method against an AUD system with a weak supervision signal • VQ-WAV2VEC [13]: a convolutional neural network with a quantization layer trained with a contrastive prediction objective on the 960 hour Librispeech corpus [42].…”
Section: F Comparison With Other Methodsmentioning
confidence: 99%
“…Evaluation of unsupervised features Unsupervised features can be evaluated with two kinds of methods, depending of the end goal of these features. In the zero resource setting [26,27], the aim is to build speech representations without any labels. Distance-based methods like ABX [28,29] or Mean Average Precision [30] evaluate the intrinsic quality of the features without having to retrain the system on any label.…”
Section: Related Workmentioning
confidence: 99%
“…In self-supervised learning for zero-resource speech modeling [15], [23], [24], [29], [38], [39], targets that a model is trained to predict are computed from the data itself [40]. A typical self-supervised representation learning model is the vector-quantized variational autoencoder (VQ-VAE) [15], which achieved a fairly good performance in ZeroSpeech 2017 [41] and 2019 [9], and has become more widely adopted [42]- [44] in the latest ZeroSpeech 2020 challenge [45]. Other selfsupervised learning algorithms such as factorized hierarchical VAE (FHVAE) [46], contrastive predictive coding (CPC) [23] and APC [29] were also extensively investigated in unsupervised subword modeling [30], [42], [47], [48] as well as in a relevant zero-resource word discrimination task [49].…”
Section: A Unsupervised Learning Techniquesmentioning
confidence: 99%
“…The z 1 representation from a well-trained FHVAE is extracted as the desired speaker-invariant phonetic representation for unsupervised subword modeling. The FHVAE model was applied in [10] and achieved good performance in the ZeroSpeech 2019 Challenge [60], which is why we compare the APC model against FHVAE in this study. Details of the FHVAE model description is provided in supplementary material (see Section S1-A).…”
Section: Comparative Approaches 1) Fhvaementioning
confidence: 99%
See 1 more Smart Citation