2019
DOI: 10.48550/arxiv.1910.12729
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Towards Unsupervised Speech Recognition and Synthesis with Quantized Speech Representation Learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
11
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
2
1

Relationship

2
1

Authors

Journals

citations
Cited by 3 publications
(11 citation statements)
references
References 0 publications
0
11
0
Order By: Relevance
“…To perform representation discretization, a learnable codebook E = (e1, e2, ..., eV ) of size V is maintained, where each ei ∈ R D is called a codeword. For an encoded frame-level representation sequence H, the closest codeword ev is used as a substitute for each representation ht, and this operation is called phonetic clustering [21]. The gradient of this non-differentiable operation is approximated by straight-through (ST) gradient estimator [22].…”
Section: Phonetic Encodermentioning
confidence: 99%
See 4 more Smart Citations
“…To perform representation discretization, a learnable codebook E = (e1, e2, ..., eV ) of size V is maintained, where each ei ∈ R D is called a codeword. For an encoded frame-level representation sequence H, the closest codeword ev is used as a substitute for each representation ht, and this operation is called phonetic clustering [21]. The gradient of this non-differentiable operation is approximated by straight-through (ST) gradient estimator [22].…”
Section: Phonetic Encodermentioning
confidence: 99%
“…where the first term is the reconstruction loss of unpaired speech Xunpair, the second term is the CTC loss for Ypair, the last term is the TTS loss for target audio Xpair, and λ is fixed to be 10 throughout the end-to-end training process. For more details, please refer to the prior work [21].…”
Section: Speech Synthesizermentioning
confidence: 99%
See 3 more Smart Citations