High quality, lightweight and adaptable TTS using LPCNet

Kons, Zvi; Shechtman, Slava; Sorin, Alex; Rabinovitz, Carmel; Hoory, Ron

doi:10.48550/arxiv.1905.00590

Cited by 4 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The EER measure was computed by employing the speaker verification (SV) network described in [28]. 5 This network was trained on 5994 speakers from the Voxceleb dataset [29] and reports an EER of 2.21% for the best performing model. In the EER evaluation we paired the 12 synthesised test utterances from each speaker with natural counterparts from the same speaker and also from the other speakers.…”

Section: Evaluation Resultsmentioning

confidence: 99%

“…This methodology requires separate acoustic models, and the quality of the output speech is rather poor [3], even though there are methods that aim to improve the quality degradation in HMM-based speech synthesis [4]. The development of systems supporting multiple voice identities in deep neural synthesis was first approached by adapting the neural network architecture fully or partially to new target speakers [5]. Other studies have proposed the training of a speaker encoder network trained jointly with the TTS model [6].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis

Lorincz¹,

Stan²,

Giurgiu³

2021

Preprint

View full text Add to dashboard Cite

Multi-speaker spoken datasets enable the creation of text-to-speech synthesis (TTS) systems which can output several voice identities. The multi-speaker (MSPK) scenario also enables the use of fewer training samples per speaker. However, in the resulting acoustic model, not all speakers exhibit the same synthetic quality, and some of the voice identities cannot be used at all.In this paper we evaluate the influence of the recording conditions, speaker gender, and speaker particularities over the quality of the synthesised output of a deep neural TTS architecture, namely Tacotron2. The evaluation is possible due to the use of a large Romanian parallel spoken corpus containing over 81 hours of data. Within this setup, we also evaluate the influence of different types of text representations: orthographic, phonetic, and phonetic extended with syllable boundaries and lexical stress markings.We evaluate the results of the MSPK system using the objective measures of equal error rate (EER) and word error rate (WER), and also look into the distances between natural and synthesised t-SNE projections of the embeddings computed by an accurate speaker verification network. The results show that there is indeed a large correlation between the recording conditions and the speaker's synthetic voice quality. The speaker gender does not influence the output, and that extending the input text representation with syllable boundaries and lexical stress information does not equally enhance the generated audio across all speaker identities. The visualisation of the t-SNE projections of the natural and synthesised speaker embeddings show that the acoustic model shifts some of the speakers' neural representation, but not all of them. As a result, these speakers have lower performances of the output speech.

show abstract

Section: Evaluation Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis

Lorincz¹,

Stan²,

Giurgiu³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…While previous works in TTS adaptation have well considered the few adaptation data setting in custom voice, they have not fully addressed the above challenges. They fine-tune the whole model Kons et al, 2019) or decoder part (Moss et al, 2020;, achieving good quality but causing too many adaptation parameters. Reducing the amount of adaptation parameters is necessary for the deployment of commercialized custom voice.…”

Section: Introductionmentioning

confidence: 99%

AdaSpeech: Adaptive Text to Speech for Custom Voice

Chen,

Tan,

et al. 2021

Preprint

View full text Add to dashboard Cite

Custom voice, a specific text to speech (TTS) service in commercial speech platforms, aims to adapt a source TTS model to synthesize personal voice for a target speaker using few speech from her/him. Custom voice presents two unique challenges for TTS adaptation: 1) to support diverse customers, the adaptation model needs to handle diverse acoustic conditions which could be very different from source speech data, and 2) to support a large number of customers, the adaptation parameters need to be small enough for each target speaker to reduce memory usage while maintaining high voice quality. In this work, we propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices. We design several techniques in AdaSpeech to address the two challenges in custom voice: 1) To handle different acoustic conditions, we model the acoustic information in both utterance and phoneme level. Specifically, we use one acoustic encoder to extract an utterance-level vector and another one to extract a sequence of phoneme-level vectors from the target speech during pre-training and fine-tuning; in inference, we extract the utterance-level vector from a reference speech and use an acoustic predictor to predict the phonemelevel vectors. 2) To better trade off the adaptation parameters and voice quality, we introduce conditional layer normalization in the mel-spectrogram decoder of AdaSpeech, and fine-tune this part in addition to speaker embedding for adaptation. We pre-train the source TTS model on LibriTTS datasets and fine-tune it on VCTK and LJSpeech datasets (with different acoustic conditions from LibriTTS) with few adaptation data, e.g., 20 sentences, about 1 minute speech. Experiment results show that AdaSpeech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker, which demonstrates its effectiveness for custom voice. The audio samples are available at https://speechresearch.github.io/adaspeech/.

show abstract

“…We hypothesize this is due to three issues: (1) limitations in the pitch representation used in LPCNet, (2) insufficient disentanglement between pitch and acoustic features, and (3) a lack of training data for very high-and low-pitched speech. Kons et al [15] sidestep these limitations by generating the input parameters using a separate neural network. However, their approach necessitates training multiple neural networks and does not generalize to unseen speakers without speaker adaptation.…”

Section: Introductionmentioning

confidence: 99%

Neural Pitch-Shifting and Time-Stretching with Controllable LPCNet

Morrison,

Jin,

Bryan

et al. 2021

Preprint

View full text Add to dashboard Cite

Modifying the pitch and timing of an audio signal are fundamental audio editing operations with applications in speech manipulation, audio-visual synchronization, and singing voice editing and synthesis. Thus far, methods for pitch-shifting and time-stretching that use digital signal processing (DSP) have been favored over deep learning approaches due to their speed and relatively higher quality. However, even existing DSP-based methods for pitch-shifting and time-stretching induce artifacts that degrade audio quality. In this paper, we propose Controllable LPCNet (CLPCNet), an improved LPCNet vocoder capable of pitch-shifting and time-stretching of speech. For objective evaluation, we show that CLPCNet performs pitch-shifting of speech on unseen datasets with high accuracy relative to prior neural methods. For subjective evaluation, we demonstrate that the quality and naturalness of pitch-shifting and time-stretching with CLPCNet on unseen datasets meets or exceeds competitive neural-or DSP-based approaches.

show abstract

High quality, lightweight and adaptable TTS using LPCNet

Cited by 4 publications

References 0 publications

An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis

An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis

AdaSpeech: Adaptive Text to Speech for Custom Voice

Neural Pitch-Shifting and Time-Stretching with Controllable LPCNet

Contact Info

Product

Resources

About