Neural speech-rate conversion with multispeaker WaveNet vocoder

Okamoto, Toshio; Matsubara, Keisuke; Toda, Tomoki; Shiga, Yoshinori; Kawai, Hisashi

doi:10.1016/j.specom.2022.01.003

Cited by 6 publications

(10 citation statements)

References 59 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Future works may investigate the combined use of phoneme and syllable rates for speaking rate conversion which showed to better correlate with perceived tempo [25]. Also, alternative approaches using neural networks [36,37] may replace the WSOLA algorithm to reduce the amount of artifacts.…”

Section: Discussionmentioning

confidence: 99%

Investigation into Target Speaking Rate Adaptation for Voice Conversion

Kuhlmann¹,

Seebauer²,

Ebbers³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

Disentangling speaker and content attributes of a speech signal into separate latent representations followed by decoding the content with an exchanged speaker representation is a popular approach for voice conversion, which can be trained with non-parallel and unlabeled speech data. However, previous approaches perform disentanglement only implicitly via some sort of information bottleneck or normalization, where it is usually hard to find a good trade-off between voice conversion and content reconstruction. Further, previous works usually do not consider an adaptation of the speaking rate to the target speaker or they put some major restrictions to the data or use case. Therefore, the contribution of this work is two-fold. First, we employ an explicit and fully unsupervised disentanglement approach, which has previously only been used for representation learning, and show that it allows to obtain both superior voice conversion and content reconstruction. Second, we investigate simple and generic approaches to linearly adapt the length of a speech signal, and hence the speaking rate, to a target speaker and show that the proposed adaptation allows to increase the speaking rate similarity with respect to the target speaker.

show abstract

Section: Discussionmentioning

confidence: 99%

Investigation into Target Speaking Rate Adaptation for Voice Conversion

Kuhlmann¹,

Seebauer²,

Ebbers³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

show abstract

“…Therefore, a powerful and efficient speaking rate control method that can be seamlessly implemented in DNN-based speech synthesis models becomes necessary. A DNN-based speaking rate control method with multi-speaker WaveNet vocoder [30] had been initially provided and it outperformed the conventional TSM-based method and source-filter vocoder [31]. However, the inference speed of the method was quite slow due to the auto-regressive structure and the huge size of the WaveNet model [23].…”

Section: Introductionmentioning

confidence: 99%

“…The goal of the control is to synthesize speech as if the speaker has uttered the speech with the specified speaking rate. However, since past studies using existing corpora [31,32] always compared speaking-rate-controlled speech with original speech, we cannot state how far those control methods are from the goal.…”

Section: Introductionmentioning

confidence: 99%

“…Our idea is illustrated in Figure 1. The proposed method interpolates melspectrograms or hidden features in the possible inner layers of HiFi-GAN to control the speaking rate without fundamental frequency analysis as [31]. When the interpolation layer is inserted before the generated waveform, the proposed method degrades to a TSM algorithm, which is used as the baseline method of this work.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Speaking-Rate-Controllable HiFi-GAN Using Feature Interpolation

Xin¹,

Takamichi²,

Okamoto³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

This paper presents a speaking-rate-controllable HiFi-GAN neural vocoder. Original HiFi-GAN is a high-fidelity, computationally efficient, and tiny-footprint neural vocoder. We attempt to incorporate a speaking rate control function into HiFi-GAN for improving the accessibility of synthetic speech. The proposed method inserts a differentiable interpolation layer into the HiFi-GAN architecture. A signal resampling method and an image scaling method are implemented in the proposed method to warp the mel-spectrograms or hidden features of the neural vocoder. We also design and open-source a Japanese speech corpus containing three kinds of speaking rates to evaluate the proposed speaking rate control method. Experimental results of comprehensive objective and subjective evaluations demonstrate that 1) the proposed method outperforms a baseline timescale modification algorithm in speech naturalness, 2) warping mel-spectrograms by image scaling obtained the best performance among all proposed methods, and 3) the proposed speaking rate control method can be incorporated into HiFi-GAN without losing computational efficiency.

show abstract

“…However, the synthesis quality of these models is not high. To improve synthesis quality for SR conversion, a neural-network-based approach with the multi-speaker AR WaveNet vocoder [48], which can be realized with time-compressed or stretched acoustic features by sinc interpolation-based resampling [49], outperforms conventional signal-processing-based models [50]. However, the AR WaveNet vocoder, even using a GPU, cannot realize realtime synthesis.…”

mentioning

confidence: 99%

Harmonic-Net: Fundamental Frequency and Speech Rate Controllable Fast Neural Vocoder

Matsubara

Okamoto

Takashima

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

There is a need to improve the synthesis quality of HiFi-GAN-based real-time neural speech waveform generative models on CPUs while preserving the controllability of fundamental frequency (f o ) and speech rate (SR). For this purpose, we propose Harmonic-Net and Harmonic-Net+, which introduce two extended functions into the HiFi-GAN generator. The first extension is a downsampling network, named the excitation signal network, that hierarchically receives multi-channel excitation signals corresponding to f o . The second extension is the layerwise pitch-dependent dilated convolutional network (LW-PDCNN), which can flexibly change its receptive fields depending on the input f o to handle large fluctuations in f o for the upsampling-based HiFi-GAN generator. The proposed explicit input of excitation signals and LW-PDCNNs corresponding to f o are expected to realize high-quality synthesis for the normal and f o -conversion conditions and for the SR-conversion condition. The results of experiments for unseen speaker synthesis, full-band singing voice synthesis, and text-to-speech synthesis show that the proposed method with harmonic waves corresponding to f o can achieve higher synthesis quality than conventional methods in all (i.e., normal, f o -conversion, and SR-conversion) conditions.

show abstract

Neural speech-rate conversion with multispeaker WaveNet vocoder

Cited by 6 publications

References 59 publications

Investigation into Target Speaking Rate Adaptation for Voice Conversion

Investigation into Target Speaking Rate Adaptation for Voice Conversion

Speaking-Rate-Controllable HiFi-GAN Using Feature Interpolation

Harmonic-Net: Fundamental Frequency and Speech Rate Controllable Fast Neural Vocoder

Contact Info

Product

Resources

About