Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-562
|View full text |Cite
|
Sign up to set email alerts
|

Improving Multi-Speaker TTS Prosody Variance with a Residual Encoder and Normalizing Flows

Abstract: Numerous examples in the literature proved that deep learning models have the ability to work well with multimodal data. Recently, CLIP has enabled deep learning systems to learn shared latent spaces between images and text descriptions, with outstanding zero-or few-shot results in downstream tasks. In this paper we explore the same idea proposed by CLIP but applied to the speech domain, where the phonetic and acoustic spaces usually coexist. We train a CLIP-based model with the aim to learn shared representat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 25 publications
(7 reference statements)
0
2
0
Order By: Relevance
“…Notably, MARTA-S' approach registers a modest improvement in BACC, escalating from 89% to 91%, with the additional advantage of using a single integrated model (instead of one per MC). However, a direct comparison of the unsupervised performance of MARTA with that of SCRAPS [31] was not feasible, due to the inaccessibility of SCRAPS as an open source model. A recent work published in [51], evaluated some speech-based self-supervised embedding methods, such as Wav2Vec [52] or HuBERT [53], in the context of parkinsonian speech.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Notably, MARTA-S' approach registers a modest improvement in BACC, escalating from 89% to 91%, with the additional advantage of using a single integrated model (instead of one per MC). However, a direct comparison of the unsupervised performance of MARTA with that of SCRAPS [31] was not feasible, due to the inaccessibility of SCRAPS as an open source model. A recent work published in [51], evaluated some speech-based self-supervised embedding methods, such as Wav2Vec [52] or HuBERT [53], in the context of parkinsonian speech.…”
Section: Discussionmentioning
confidence: 99%
“…In a broader context, not dedicated to the analysis of the speech of PD patients, a recent work [31] proposed a CLIP-like [32] model architecture called SCRAPS. This model codifies phonetic and acoustic information into a unified latent space.…”
Section: Introductionmentioning
confidence: 99%
“…To reflect such acoustic features on generated speech, there have been many attempts to synthesize speech with rich and diverse prosodic patterns. One of the widely used approaches for prosody modeling is to exploit generative models like VAEs and flow models (Hsu et al, 2019;Lee et al, 2021;Ren et al, 2021b;Valle et al, 2021;Vallés-Pérez et al, 2021). These generative TTS models control the extent of variation in speech by sampling prior distribution with adequate temperatures.…”
Section: Introductionmentioning
confidence: 99%