Improving Multi-Speaker TTS Prosody Variance with a Residual Encoder and Normalizing Flows

Vallés-Ṕerez, Iván; Roth, Julian; Beringer, Grzegorz; Barra-Chicote, Roberto; Droppo, Jasha

doi:10.21437/interspeech.2021-562

Cited by 5 publications

(3 citation statements)

References 25 publications

(7 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Notably, MARTA-S' approach registers a modest improvement in BACC, escalating from 89% to 91%, with the additional advantage of using a single integrated model (instead of one per MC). However, a direct comparison of the unsupervised performance of MARTA with that of SCRAPS [31] was not feasible, due to the inaccessibility of SCRAPS as an open source model. A recent work published in [51], evaluated some speech-based self-supervised embedding methods, such as Wav2Vec [52] or HuBERT [53], in the context of parkinsonian speech.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

MARTA: a model for the automatic phonemic grouping of the parkinsonian speech

Guerrero-López,

Arias-Londoño,

Shattuck-Hufnagel

et al. 2024

Preprint

View full text Add to dashboard Cite

Parkinson’s disease significantly impacts speech, particularly affecting phonemic groups like stop-plosives, fricatives, and affricates. However, its objective impact on the different phonemic groups has been briefly addressed in the past. This study introduces a new model, called MARTA, built upon a Gaussian Mixture Variational AutoEncoder with metric learning to measure the disease’s impact on the phonemic grouping automatically and objectively. MARTA was trained on normophonic speech before adapting it to parkinsonian speech. The model effectively clusters phonemic groups unsupervised and demonstrates enhanced discriminative power when supervised using forced-aligned labels. Our findings reveal that beyond the traditionally affected phonemes, Parkinson’s disease not only affects stop-plosives, voiced-plosives, and nasals, but also significantly impacts liquids, vowels, and fricatives, with the model achieving a benchmarking 91% ± 9 discrimination capability. An in-depth evaluation of the impact of the disease on the different phonemic groups represents an advance in the current knowledge of its effects on the speech, and has clear implications in the speech therapy of people with Parkinson’s disease. Moreover, regardless of the specific application domain presented, the model introduced has potential downstream utility in assessing the manner of articulation, whether influenced by other medical conditions or certain dialectal variations.

show abstract

Section: Discussionmentioning

confidence: 99%

“…In a broader context, not dedicated to the analysis of the speech of PD patients, a recent work [31] proposed a CLIP-like [32] model architecture called SCRAPS. This model codifies phonetic and acoustic information into a unified latent space.…”

Section: Introductionmentioning

confidence: 99%

MARTA: a model for the automatic phonemic grouping of the parkinsonian speech

Guerrero-López,

Arias-Londoño,

Shattuck-Hufnagel

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…To reflect such acoustic features on generated speech, there have been many attempts to synthesize speech with rich and diverse prosodic patterns. One of the widely used approaches for prosody modeling is to exploit generative models like VAEs and flow models (Hsu et al, 2019;Lee et al, 2021;Ren et al, 2021b;Valle et al, 2021;Vallés-Pérez et al, 2021). These generative TTS models control the extent of variation in speech by sampling prior distribution with adequate temperatures.…”

Section: Introductionmentioning

confidence: 99%

DPP-TTS: Diversifying prosodic features of speech via determinantal point processes

Joo,

Koh,

Jung

2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

With the rapid advancement in deep generative models, recent neural Text-To-Speech (TTS) models have succeeded in synthesizing humanlike speech. There have been some efforts to generate speech with various prosody beyond monotonous prosody patterns. However, previous works have several limitations. First, typical TTS models depend on the scaled sampling temperature for boosting the diversity of prosody. Speech samples generated at high sampling temperatures often lack perceptual prosodic diversity, thereby hampering the naturalness of the speech. Second, the diversity among samples is neglected since the sampling procedure often focuses on a single speech sample rather than multiple ones. In this paper, we propose DPP-TTS: a text-to-speech model based on Determinantal Point Processes (DPPs) with a new objective function and prosody diversifying module. Our TTS model is capable of generating speech samples that simultaneously consider perceptual diversity in each sample and among multiple samples. We demonstrate that DPP-TTS generates speech samples with more diversified prosody than baselines in the side-by-side comparison test considering the naturalness of speech at the same time.

show abstract