2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2021
DOI: 10.1109/asru51503.2021.9688088
|View full text |Cite
|
Sign up to set email alerts
|

TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training

Abstract: Non-parallel many-to-many voice conversion remains an interesting but challenging speech processing task. Recently, AutoVC, a conditional autoencoder based method, achieved excellent conversion results by disentangling the speaker identity and the speech content using informationconstraining bottlenecks. However, due to the pure autoencoder training method, it is difficult to evaluate the separation effect of content and speaker identity. In this paper, a novel voice conversion framework, named T ext Guided Au… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 17 publications
(4 citation statements)
references
References 16 publications
0
4
0
Order By: Relevance
“…(Zhou et al, 2022b) is one of the firsts that introduce the ability to simulate emotion intensity and secondary emotion through a rank-based emotion attribute vector. (Tang et al, 2023) represents emotion as a vector embedding extracted from a pretrained speech emotion recognizer, which also allows the simulation of both characteristics by combining the hidden state of embedding.…”
Section: Emotion Modelling In Text-to-speechmentioning
confidence: 99%
See 2 more Smart Citations
“…(Zhou et al, 2022b) is one of the firsts that introduce the ability to simulate emotion intensity and secondary emotion through a rank-based emotion attribute vector. (Tang et al, 2023) represents emotion as a vector embedding extracted from a pretrained speech emotion recognizer, which also allows the simulation of both characteristics by combining the hidden state of embedding.…”
Section: Emotion Modelling In Text-to-speechmentioning
confidence: 99%
“…As discussed in Section 2.3, there are mainly 2 candidates for our baselines, (Zhou et al, 2022b) and (Tang et al, 2023). We can only utilize (Zhou et al, 2022b) as our baseline as the latter is not open-sourced.…”
Section: Baseline Setupmentioning
confidence: 99%
See 1 more Smart Citation
“…The method of HiFiSinger proposed by Chen et al [3], multi-scale adversarial training in both the acoustic model and vocoder was introduced to tackle the difficulty of singing modeling caused by the high sampling rate. A difference between singing voice synthesis and speech synthesis is that the prosody information in the song is more complex [9]- [12]. The vocal mechanism of singing and voice is different, and the pitch is relatively stable in singing.…”
Section: Introductionmentioning
confidence: 99%