2022
DOI: 10.1109/jstsp.2022.3193761
|View full text |Cite
|
Sign up to set email alerts
|

A Comparative Study of Self-Supervised Speech Representation Based Voice Conversion

Abstract: We present a large-scale comparative study of selfsupervised speech representation (S3R)-based voice conversion (VC). In the context of recognition-synthesis VC, S3Rs are attractive owing to their potential to replace expensive supervised representations such as phonetic posteriorgrams (PPGs), which are commonly adopted by state-of-the-art VC systems. Using S3PRL-VC, an open-source VC software we previously developed, we provide a series of in-depth objective and subjective analyses under three VC settings: in… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(2 citation statements)
references
References 68 publications
0
2
0
Order By: Relevance
“…As a follow up to our prior effort, as presented in [17], this work proposes a novel strategy for anonymization via voice conversion, which, instead of manipulating the xvectors, leverages the approach of ContentVec [36] to obtain speaker-independent speech representations and starts from pre-trained models within the S3PRL toolkit [37]. The proposed strategy is evaluated on a public dataset and compared against a variety of neural and signal-processing-based voice conversion methods.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…As a follow up to our prior effort, as presented in [17], this work proposes a novel strategy for anonymization via voice conversion, which, instead of manipulating the xvectors, leverages the approach of ContentVec [36] to obtain speaker-independent speech representations and starts from pre-trained models within the S3PRL toolkit [37]. The proposed strategy is evaluated on a public dataset and compared against a variety of neural and signal-processing-based voice conversion methods.…”
Section: Introductionmentioning
confidence: 99%
“…Specifically, it allows us to evaluate the generative capabilities of pre-trained models, as well as the generalizability of the resulting conversion model. The resulting anonymization task was mainly derived from the setup proposed in [37] for voice conversion. The speech embeddings were computed using the introduced disentanglement mechanism on the WavLM features in the present work.…”
mentioning
confidence: 99%