IntroductionText to speech synthesis (TTS) is an important technique in multilingual spoken language communications. Due to the flexibility and small footprint, statistical parametric speech synthesis (SPSS) has become mainstream in TTS. This paper investigates the improvement of the vocoded speech quality, which is a problem in SPSS.The vocoder in SPSS is a module that converts acoustic features estimated from linguistic information by acoustic models into speech waveforms. Some vocoders have been investigated from a simple mel-log spectrum approximate (MLSA) filter with a simple pulse excitation and melcepstrum [1] to high-quality ones, such as STRAIGHT [2] and WORLD [3]. However, these high-quality vocoders are intended to analyze and convert high-quality speech and a number of acoustic parameters necessary to synthesize speech with the same quality as the original, but not for TTS. The number of parameters must be reduced to apply these highquality vocoders to SPSS due to constraints on the number of parameters [4]. These constraints deteriorate synthesis quality even if the acoustic model perfectly estimates the acoustic parameters. In other words, speech quality in TTS will reach a peak due to the vocoder performance. Herein a method is investigated to improve this upper limit.In SPSS, an acoustic model is trained from speech corpora and the maximum likelihood model parameters are estimated. Recently, deep neural networks (DNNs) have been introduced for acoustic model training in SPSS. DNN improves synthesis accuracy compared to the conventional hidden Markov model (HMM) [5,6]. Additionally corpus-dependent high-quality vocoders with DNNs have been investigated [7,8], whereas the conventional high-quality ones described above [2,3] are corpus-independent. Although corpus-dependent high-quality vocoders with DNNs improve the speech quality compared to the conventional STRAIGHT vocoder in both HMMand DNN-based speech synthesis [7], the synthesis quality depends greatly on the estimation accuracy of the glottal closure instants [9]. Neural network-based vocoders such as WaveNet y and SampleRNN [8] require a huge amount of speech corpus for high-quality synthesis.