Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-349
|View full text |Cite
|
Sign up to set email alerts
|

Learning Latent Representations for Speech Generation and Transformation

Abstract: An ability to model a generative process and learn a latent representation for speech in an unsupervised fashion will be crucial to process vast quantities of unlabelled speech data. Recently, deep probabilistic generative models such as Variational Autoencoders (VAEs) have achieved tremendous success in modeling natural images. In this paper, we apply a convolutional VAE to model the generative process of natural speech. We derive latent space arithmetic operations to disentangle learned latent representation… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
74
0
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
4
2

Relationship

0
10

Authors

Journals

citations
Cited by 99 publications
(76 citation statements)
references
References 17 publications
(22 reference statements)
1
74
0
1
Order By: Relevance
“…It can be clearly seen that the proposed CycleVAE-based VC generates latent features with higher correlation degree compared to conventional VAE. As studied in [32], higher cosine similarities would be produced by latent attributes that represent either equal phonetic space or equal speaker identities. Hence, Cycle-VAE is more likely to give latent representations that are closer to phonetic domain due to different speaker identities.…”
Section: Objective Evaluationmentioning
confidence: 99%
“…It can be clearly seen that the proposed CycleVAE-based VC generates latent features with higher correlation degree compared to conventional VAE. As studied in [32], higher cosine similarities would be produced by latent attributes that represent either equal phonetic space or equal speaker identities. Hence, Cycle-VAE is more likely to give latent representations that are closer to phonetic domain due to different speaker identities.…”
Section: Objective Evaluationmentioning
confidence: 99%
“…In the third layer, we also stride along the frequency axis, but perform this only once to not lose too much frequency information. The fifth convolutional layer has a filter of size 48 in the frequency domain, which captures frequency relationships in a larger range of the CQT, as shown to be successful in [17] and [22]. The error function is the Mean-Squared Error (MSE) between the pitch shift estimate and ground truth over the full sequence of notes in a performance.…”
Section: Neural Network Structurementioning
confidence: 99%
“…This work makes a comparison of different models, including VAEs, However, this work works with raw audio instead of MIDI files since its objective is to model sound and not music generation. VAEs have been used also for processing of speech signals with the objective of making modifications in some attributes of the speakers [16] [17] 3 Variational Autoencoders Variational Autoencoders arose from the evolution of autoencoders [18] [5] [19]. Both techniques aim to codify a set of data into a smaller vector, and then reconstructing the original data from this vector.…”
Section: State Of the Artmentioning
confidence: 99%