TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training

Huaizhen, Tang,; Zhang, Xulong; Wang, Jianzong; Cheng, Ning; Zeng, Zhen; Xiao, Edward; Xiao, Jing

doi:10.1109/asru51503.2021.9688088

Cited by 17 publications

(4 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(Zhou et al, 2022b) is one of the firsts that introduce the ability to simulate emotion intensity and secondary emotion through a rank-based emotion attribute vector. (Tang et al, 2023) represents emotion as a vector embedding extracted from a pretrained speech emotion recognizer, which also allows the simulation of both characteristics by combining the hidden state of embedding.…”

Section: Emotion Modelling In Text-to-speechmentioning

confidence: 99%

“…As discussed in Section 2.3, there are mainly 2 candidates for our baselines, (Zhou et al, 2022b) and (Tang et al, 2023). We can only utilize (Zhou et al, 2022b) as our baseline as the latter is not open-sourced.…”

Section: Baseline Setupmentioning

confidence: 99%

“…This would allow a wider representation of emotions, yet still underexplored in the emotional TTS design. To the best of our knowledge, it has only been adapted in a few neural emotional TTS, most notably (Zhou et al, 2022b;Tang et al, 2023), where they managed to simulate intensity-level and mixture of emotions through rank-based method and emotion embedding conditioning, respectively.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

NIX-TTS: Lightweight and End-to-End Text-to-Speech Via Module-Wise Distillation

Chevi¹,

Prasojo²,

Aji

et al. 2023

2022 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

We often verbally express emotions in a multifaceted manner, they may vary in their intensities and may be expressed not just as a single but as a mixture of emotions. This wide spectrum of emotions is well-studied in the structural model of emotions, which represents variety of emotions as derivative products of primary emotions with varying degrees of intensity. In this paper, we propose an emotional text-to-speech design to simulate a wider spectrum of emotions grounded on the structural model. Our proposed design, Daisy-TTS † , incorporates a prosody encoder to learn emotionally-separable prosody embedding as a proxy for emotion. This emotion representation allows the model to simulate: (1) Primary emotions, as learned from the training samples, (2) Secondary emotions, as a mixture of primary emotions, (3) Intensity-level, by scaling the emotion embedding, and (4) Emotions polarity, by negating the emotion embedding. Through a series of perceptual evaluations, Daisy-TTS demonstrated overall higher emotional speech naturalness and emotion perceiveability compared to the baseline.

show abstract

Section: Emotion Modelling In Text-to-speechmentioning

confidence: 99%

Section: Baseline Setupmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

NIX-TTS: Lightweight and End-to-End Text-to-Speech Via Module-Wise Distillation

Chevi¹,

Prasojo²,

Aji

et al. 2023

2022 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

show abstract

“…The method of HiFiSinger proposed by Chen et al [3], multi-scale adversarial training in both the acoustic model and vocoder was introduced to tackle the difficulty of singing modeling caused by the high sampling rate. A difference between singing voice synthesis and speech synthesis is that the prosody information in the song is more complex [9]- [12]. The vocal mechanism of singing and voice is different, and the pitch is relatively stable in singing.…”

Section: Introductionmentioning

confidence: 99%

SUSing: SU-net for Singing Voice Synthesis

Zhang¹,

Wang²,

Cheng³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Singing voice synthesis is a generative task that involves multi-dimensional control of the singing model, including lyrics, pitch, and duration, and includes the timbre of the singer and singing skills such as vibrato. In this paper, we proposed SU-net for singing voice synthesis named SUSing. Synthesizing singing voice is treated as a translation task between lyrics and music score and spectrum. The lyrics and music score information is encoded into a two-dimensional feature representation through the convolution layer. The two-dimensional feature and its frequency spectrum are mapped to the target spectrum in an autoregressive manner through a SU-net network. Within the SU-net the stripe pooling method is used to replace the alternate global pooling method to learn the vertical frequency relationship in the spectrum and the changes of frequency in the time domain. The experimental results on the public dataset Kiritan show that the proposed method can synthesize more natural singing voices.

show abstract