A Neural Vocoder With Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis

Ai, Yang; Ling, Zhen-Hua

doi:10.1109/taslp.2020.2970241

Cited by 31 publications

(28 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Inspired by the neural excitation generation of differentiable digital signal processing (DDSP) [52] and the neural spectral filtering of NSF, completely differentiable source-filter vocoders with a GAN structure such as neural homomorphic vocoder (NHV) [20] and HooliGAN [21] also have been proposed. Furthermore, the authors of HiNet [22] also adopt a deep NN (DNN) model and an NSF model with GAN structures to respectively predict amplitude spectrum and phase for hierarchical speech generation.…”

Section: B Gan-based Vocodersmentioning

confidence: 99%

Quasi-Periodic Parallel WaveGAN: A Non-Autoregressive Raw Waveform Generative Model With Pitch-Dependent Dilated Convolution Neural Network

Hayashi

Okamoto

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

In this paper, we propose a quasi-periodic parallel WaveGAN (QPPWG) waveform generative model, which applies a quasi-periodic (QP) structure to a parallel WaveGAN (PWG) model using pitch-dependent dilated convolution networks (PD-CNNs). PWG is a small-footprint GAN-based raw waveform generative model, whose generation time is much faster than real time because of its compact model and non-autoregressive (non-AR) and non-causal mechanisms. Although PWG achieves high-fidelity speech generation, the generic and simple network architecture lacks pitch controllability for an unseen auxiliary fundamental frequency (F0) feature such as a scaled F0. To improve the pitch controllability and speech modeling capability, we apply a QP structure with PDCNNs to PWG, which introduces pitch information to the network by dynamically changing the network architecture corresponding to the auxiliary F0 feature. Both objective and subjective experimental results show that QP-PWG outperforms PWG when the auxiliary F0 feature is scaled. Moreover, analyses of the intermediate outputs of QPPWG also show better tractability and interpretability of QPPWG, which respectively models spectral and excitation-like signals using the cascaded fixed and adaptive blocks of the QP structure. Index Terms-Neural vocoder, parallel WaveGAN, quasiperiodic WaveNet, pitch-dependent dilated convolution I. INTRODUCTION S PEECH generation is a technique to generate specific speech according to given inputs such as texts (text-tospeech, TTS), the speech of a source speaker (speaker voice conversion, VC), and noisy speech (speech enhancement, SE). The core of speech generation is the controllability of speech components, and the fundamental technique is called a vocoder [1]-[3]. A vocoder encodes speech into acoustic Manuscript received xxx xx, 2020; revised xxx xx, 2020.

show abstract

Section: B Gan-based Vocodersmentioning

confidence: 99%

Quasi-Periodic Parallel WaveGAN: A Non-Autoregressive Raw Waveform Generative Model With Pitch-Dependent Dilated Convolution Neural Network

Hayashi

Okamoto

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…Next, subjective experiments show that our bandwidth extension method consistently offers significant perceptual quality improvement to the results of speech denoising systems including HiFi-GAN [5], DEMUCS [7] and DeepMMSE [8]. It also improves the quality of vocoders including WaveNet [1], WaveRNN [12] and HiNet [13] which could potentially be applied to TTS as well.…”

Section: Introductionmentioning

confidence: 86%

“…We use the same trained model as in the denoising task in Section 4.2, and apply it to the outputs of three vocoding algorithms, including WaveNet [1], WaveRNN [12] and HiNet [13]. We took their audio samples from HiNet's project website.…”

Section: Bandwidth Extension For Waveform Generationmentioning

confidence: 99%

Bandwidth Extension is All You Need

Wang

Finkelstein

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Speech generation and enhancement have seen recent breakthroughs in quality thanks to deep learning. These methods typically operate at a limited sampling rate of 16-22kHz due to computational complexity and available datasets. This limitation imposes a gap between the output of such methods and that of high-fidelity (≥44kHz) real-world audio applications. This paper proposes a new bandwidth extension (BWE) method that expands 8-16kHz speech signals to 48kHz. The method is based on a feed-forward WaveNet architecture trained with a GAN-based deep feature loss. A mean-opinionscore (MOS) experiment shows significant improvement in quality over state-of-the-art BWE methods. An AB test reveals that our 16to-48kHz BWE is able to achieve fidelity that is typically indistinguishable from real high-fidelity recordings. We use our method to enhance the output of recent speech generation and denoising methods, and experiments demonstrate significant improvement in sound quality over these baselines. We propose this as a general approach to narrow the gap between generated speech and recorded speech, without the need to adapt such methods to higher sampling rates.

show abstract

“…Eunwoo et al [24] proposed a Long-Short-Term-Memory (LSTM) based Recurrent Neural Network for TTS. Furthermore, many researchers have used other Neural Networks for TTS [25], [26]. These autoregressive models directly generate raw audios, which makes them expensive and slow.…”

Section: A Audio Generationmentioning

confidence: 99%

Guided Generative Adversarial Neural Network for Representation Learning and Audio Generation Using Fewer Labelled Audio Data

Haque

Rana

Liu

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

The Generation power of Generative Adversarial Neural Networks (GANs) has shown great promise to learn representations from unlabelled data while guided by a small amount of labelled data. We aim to utilise the generation power of GANs to learn Audio Representations. Most existing studies are, however, focused on images. Some studies use GANs for speech generation, but they are conditioned on text or acoustic features, limiting their use for other audio, such as instruments, and even for speech where transcripts are limited. This paper proposes a novel GAN-based model that we named Guided Generative Adversarial Neural Network (GGAN), which can learn powerful representations and generate good-quality samples using a small amount of labelled data as guidance. Experimental results based on a speech [Speech Command Dataset (S09)] and a non-speech [Musical Instrument Sound dataset (Nsyth)] dataset demonstrate that using only 5% of labelled data as guidance, GGAN learns significantly better representations than the state-of-the-art models.

show abstract

A Neural Vocoder With Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis

Cited by 31 publications

References 23 publications

Quasi-Periodic Parallel WaveGAN: A Non-Autoregressive Raw Waveform Generative Model With Pitch-Dependent Dilated Convolution Neural Network

Quasi-Periodic Parallel WaveGAN: A Non-Autoregressive Raw Waveform Generative Model With Pitch-Dependent Dilated Convolution Neural Network

Bandwidth Extension is All You Need

Guided Generative Adversarial Neural Network for Representation Learning and Audio Generation Using Fewer Labelled Audio Data

Contact Info

Product

Resources

About