The Speaker and Language Recognition Workshop (Odyssey 2020) 2020
DOI: 10.21437/odyssey.2020-31
|View full text |Cite
|
Sign up to set email alerts
|

Many-to-Many Voice Conversion Using Cycle-Consistent Variational Autoencoder with Multiple Decoders

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 11 publications
(8 citation statements)
references
References 0 publications
0
7
0
Order By: Relevance
“…Compared with the presented models, the approach developed in this paper is a deterministic uneven autoencoder using a single encoder to create a latent representation of the ADS-B data linked to several decoders, each getting different data chosen thanks to a discriminating feature. This idea was used by Yook et al (2020) to separate the sound received by speakers placed differently but to the best of our knowledge, was never used in the anomaly detection field. As a result, the latent space created from the single encoder well represents the ADS-B data while the different specialized decoders well capture the information, addressing the variability of the time series over certain period of time, resulting into better detection.…”
Section: Machine Learning Based Anomaly Detection Techniquesmentioning
confidence: 99%
“…Compared with the presented models, the approach developed in this paper is a deterministic uneven autoencoder using a single encoder to create a latent representation of the ADS-B data linked to several decoders, each getting different data chosen thanks to a discriminating feature. This idea was used by Yook et al (2020) to separate the sound received by speakers placed differently but to the best of our knowledge, was never used in the anomaly detection field. As a result, the latent space created from the single encoder well represents the ADS-B data while the different specialized decoders well capture the information, addressing the variability of the time series over certain period of time, resulting into better detection.…”
Section: Machine Learning Based Anomaly Detection Techniquesmentioning
confidence: 99%
“…VAEs can be combined with generative adversarial networks (GANs) [16] to enhance the quality of the converted speech, where the decoder of the VAE is shared with the generator of the GAN [17]. The VAE-GAN can be extended to include the cycle-consistency loss [18], [19] to further improve the voice quality, especially for non-parallel training data. This is known as a cycle-consistent variational autoencoding generative adversarial network (CycleVAE-GAN) [19].…”
Section: B Voice Conversionmentioning
confidence: 99%
“…CycleGAN and its variations [67,51] have seen considerable success in performing cross-modal unsupervised domain transfer, for medical imaging [19,56] and audio to visual translation [17], but often encode information as high frequency signal that is invisible to the human eye and susceptible to adversarial attacks [6]. An alternative approach involves training a VAE, subject to a cycle-consistency condition [26,62], but these works were restricted to domain transfers within a single modality. Most similar is the joint audio and video model proposed by Tian et al [55], which uses a VAE to map between two incompatible latent spaces using supervised alignment of attributes, however, it operates on a word not phoneme level, and has no mechanism to ensure temporal smoothness nor information throughput.…”
Section: Related Workmentioning
confidence: 99%
“…However, the learned encoders are only approximations to the true distribution and not all points in Z are modelled equally well, leading to information loss. To bridge different latent space structures, we introduce an additional cycle constraint that works without image correspondence and ensures that samples from the audio latent space posterior are reconstructed well, similar in spirit to [26,62]. Figure 4 shows the cyclic chaining from audio latent code to video and back.…”
Section: Linking Audio and Visual Spacesmentioning
confidence: 99%