Interspeech 2016 2016
DOI: 10.21437/interspeech.2016-1183
|View full text |Cite
|
Sign up to set email alerts
|

Modeling and Transforming Speech Using Variational Autoencoders

Abstract: Latent generative models can learn higher-level underlying factors from complex data in an unsupervised manner. Such models can be used in a wide range of speech processing applications, including synthesis, transformation and classification. While there have been many advances in this field in recent years, the application of the resulting models to speech processing tasks is generally not explicitly considered. In this paper we apply the variational autoencoder (VAE) to the task of modeling frame-wise spectr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
36
0
1

Year Published

2017
2017
2022
2022

Publication Types

Select...
4
4
2

Relationship

0
10

Authors

Journals

citations
Cited by 51 publications
(38 citation statements)
references
References 12 publications
1
36
0
1
Order By: Relevance
“…Recent works have proven the viability of speech modeling with VAEs [3,4]. A C-VAE that realizes the PGM in Fig.…”
Section: Modeling Speech With a C-vaementioning
confidence: 99%
“…Recent works have proven the viability of speech modeling with VAEs [3,4]. A C-VAE that realizes the PGM in Fig.…”
Section: Modeling Speech With a C-vaementioning
confidence: 99%
“…In the speech and audio domain, VAEs have mainly been used for speech generation and transformation [8]. They have also been used to learn phonetic content or speaker identity in speech segments without supervisory data [7,8]. Moreover, a framework based on VAE was used in [15] to learn both frame-level and utterance-level robust representations.…”
Section: Related Workmentioning
confidence: 99%
“…A VAEbased framework was used in [18] to extract both frame-level and utterance-level features that were used in combination with other features for robust speech recognition. A fully-connected VAE was used in [14] to learn a frame-level latent representation, and evaluated using a Gaussian diffusion process to generate and concatenate multiple samples that varied smoothly in time.…”
Section: Related Workmentioning
confidence: 99%