“…Specifically, conventional vocoders [1], [2] based on source-filter models [3], [4], [5] can flexibly control speech characteristics, but the quality of the generated speech is low because of their over-simplified speech production process. Recent high-fidelity neural vocoders [6], [7], [8], [9], [10], [11], [12], [13], [14], [15] lack the robustness to unseen data because of their purely data-driven training-manners. For example, the state-of-the-art neural vocoder, HiFi-GAN [12], fails to generate high-fidelity speech when the input features include F 0 values deviating from the F 0 range of the training data.…”