2019
DOI: 10.48550/arxiv.1905.00590
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

High quality, lightweight and adaptable TTS using LPCNet

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
5
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(5 citation statements)
references
References 0 publications
0
5
0
Order By: Relevance
“…The EER measure was computed by employing the speaker verification (SV) network described in [28]. 5 This network was trained on 5994 speakers from the Voxceleb dataset [29] and reports an EER of 2.21% for the best performing model. In the EER evaluation we paired the 12 synthesised test utterances from each speaker with natural counterparts from the same speaker and also from the other speakers.…”
Section: Evaluation Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…The EER measure was computed by employing the speaker verification (SV) network described in [28]. 5 This network was trained on 5994 speakers from the Voxceleb dataset [29] and reports an EER of 2.21% for the best performing model. In the EER evaluation we paired the 12 synthesised test utterances from each speaker with natural counterparts from the same speaker and also from the other speakers.…”
Section: Evaluation Resultsmentioning
confidence: 99%
“…This methodology requires separate acoustic models, and the quality of the output speech is rather poor [3], even though there are methods that aim to improve the quality degradation in HMM-based speech synthesis [4]. The development of systems supporting multiple voice identities in deep neural synthesis was first approached by adapting the neural network architecture fully or partially to new target speakers [5]. Other studies have proposed the training of a speaker encoder network trained jointly with the TTS model [6].…”
Section: Introductionmentioning
confidence: 99%
“…While previous works in TTS adaptation have well considered the few adaptation data setting in custom voice, they have not fully addressed the above challenges. They fine-tune the whole model Kons et al, 2019) or decoder part (Moss et al, 2020;, achieving good quality but causing too many adaptation parameters. Reducing the amount of adaptation parameters is necessary for the deployment of commercialized custom voice.…”
Section: Introductionmentioning
confidence: 99%
“…We hypothesize this is due to three issues: (1) limitations in the pitch representation used in LPCNet, (2) insufficient disentanglement between pitch and acoustic features, and (3) a lack of training data for very high-and low-pitched speech. Kons et al [15] sidestep these limitations by generating the input parameters using a separate neural network. However, their approach necessitates training multiple neural networks and does not generalize to unseen speakers without speaker adaptation.…”
Section: Introductionmentioning
confidence: 99%