“…Inspired by the neural excitation generation of differentiable digital signal processing (DDSP) [52] and the neural spectral filtering of NSF, completely differentiable source-filter vocoders with a GAN structure such as neural homomorphic vocoder (NHV) [20] and HooliGAN [21] also have been proposed. Furthermore, the authors of HiNet [22] also adopt a deep NN (DNN) model and an NSF model with GAN structures to respectively predict amplitude spectrum and phase for hierarchical speech generation.…”
In this paper, we propose a quasi-periodic parallel WaveGAN (QPPWG) waveform generative model, which applies a quasi-periodic (QP) structure to a parallel WaveGAN (PWG) model using pitch-dependent dilated convolution networks (PD-CNNs). PWG is a small-footprint GAN-based raw waveform generative model, whose generation time is much faster than real time because of its compact model and non-autoregressive (non-AR) and non-causal mechanisms. Although PWG achieves high-fidelity speech generation, the generic and simple network architecture lacks pitch controllability for an unseen auxiliary fundamental frequency (F0) feature such as a scaled F0. To improve the pitch controllability and speech modeling capability, we apply a QP structure with PDCNNs to PWG, which introduces pitch information to the network by dynamically changing the network architecture corresponding to the auxiliary F0 feature. Both objective and subjective experimental results show that QP-PWG outperforms PWG when the auxiliary F0 feature is scaled. Moreover, analyses of the intermediate outputs of QPPWG also show better tractability and interpretability of QPPWG, which respectively models spectral and excitation-like signals using the cascaded fixed and adaptive blocks of the QP structure. Index Terms-Neural vocoder, parallel WaveGAN, quasiperiodic WaveNet, pitch-dependent dilated convolution I. INTRODUCTION S PEECH generation is a technique to generate specific speech according to given inputs such as texts (text-tospeech, TTS), the speech of a source speaker (speaker voice conversion, VC), and noisy speech (speech enhancement, SE). The core of speech generation is the controllability of speech components, and the fundamental technique is called a vocoder [1]-[3]. A vocoder encodes speech into acoustic Manuscript received xxx xx, 2020; revised xxx xx, 2020.
“…Inspired by the neural excitation generation of differentiable digital signal processing (DDSP) [52] and the neural spectral filtering of NSF, completely differentiable source-filter vocoders with a GAN structure such as neural homomorphic vocoder (NHV) [20] and HooliGAN [21] also have been proposed. Furthermore, the authors of HiNet [22] also adopt a deep NN (DNN) model and an NSF model with GAN structures to respectively predict amplitude spectrum and phase for hierarchical speech generation.…”
In this paper, we propose a quasi-periodic parallel WaveGAN (QPPWG) waveform generative model, which applies a quasi-periodic (QP) structure to a parallel WaveGAN (PWG) model using pitch-dependent dilated convolution networks (PD-CNNs). PWG is a small-footprint GAN-based raw waveform generative model, whose generation time is much faster than real time because of its compact model and non-autoregressive (non-AR) and non-causal mechanisms. Although PWG achieves high-fidelity speech generation, the generic and simple network architecture lacks pitch controllability for an unseen auxiliary fundamental frequency (F0) feature such as a scaled F0. To improve the pitch controllability and speech modeling capability, we apply a QP structure with PDCNNs to PWG, which introduces pitch information to the network by dynamically changing the network architecture corresponding to the auxiliary F0 feature. Both objective and subjective experimental results show that QP-PWG outperforms PWG when the auxiliary F0 feature is scaled. Moreover, analyses of the intermediate outputs of QPPWG also show better tractability and interpretability of QPPWG, which respectively models spectral and excitation-like signals using the cascaded fixed and adaptive blocks of the QP structure. Index Terms-Neural vocoder, parallel WaveGAN, quasiperiodic WaveNet, pitch-dependent dilated convolution I. INTRODUCTION S PEECH generation is a technique to generate specific speech according to given inputs such as texts (text-tospeech, TTS), the speech of a source speaker (speaker voice conversion, VC), and noisy speech (speech enhancement, SE). The core of speech generation is the controllability of speech components, and the fundamental technique is called a vocoder [1]-[3]. A vocoder encodes speech into acoustic Manuscript received xxx xx, 2020; revised xxx xx, 2020.
“…Next, subjective experiments show that our bandwidth extension method consistently offers significant perceptual quality improvement to the results of speech denoising systems including HiFi-GAN [5], DEMUCS [7] and DeepMMSE [8]. It also improves the quality of vocoders including WaveNet [1], WaveRNN [12] and HiNet [13] which could potentially be applied to TTS as well.…”
Section: Introductionmentioning
confidence: 86%
“…We use the same trained model as in the denoising task in Section 4.2, and apply it to the outputs of three vocoding algorithms, including WaveNet [1], WaveRNN [12] and HiNet [13]. We took their audio samples from HiNet's project website.…”
Section: Bandwidth Extension For Waveform Generationmentioning
Speech generation and enhancement have seen recent breakthroughs in quality thanks to deep learning. These methods typically operate at a limited sampling rate of 16-22kHz due to computational complexity and available datasets. This limitation imposes a gap between the output of such methods and that of high-fidelity (≥44kHz) real-world audio applications. This paper proposes a new bandwidth extension (BWE) method that expands 8-16kHz speech signals to 48kHz. The method is based on a feed-forward WaveNet architecture trained with a GAN-based deep feature loss. A mean-opinionscore (MOS) experiment shows significant improvement in quality over state-of-the-art BWE methods. An AB test reveals that our 16to-48kHz BWE is able to achieve fidelity that is typically indistinguishable from real high-fidelity recordings. We use our method to enhance the output of recent speech generation and denoising methods, and experiments demonstrate significant improvement in sound quality over these baselines. We propose this as a general approach to narrow the gap between generated speech and recorded speech, without the need to adapt such methods to higher sampling rates.
“…Eunwoo et al [24] proposed a Long-Short-Term-Memory (LSTM) based Recurrent Neural Network for TTS. Furthermore, many researchers have used other Neural Networks for TTS [25], [26]. These autoregressive models directly generate raw audios, which makes them expensive and slow.…”
The Generation power of Generative Adversarial Neural Networks (GANs) has shown great promise to learn representations from unlabelled data while guided by a small amount of labelled data. We aim to utilise the generation power of GANs to learn Audio Representations. Most existing studies are, however, focused on images. Some studies use GANs for speech generation, but they are conditioned on text or acoustic features, limiting their use for other audio, such as instruments, and even for speech where transcripts are limited. This paper proposes a novel GAN-based model that we named Guided Generative Adversarial Neural Network (GGAN), which can learn powerful representations and generate good-quality samples using a small amount of labelled data as guidance. Experimental results based on a speech [Speech Command Dataset (S09)] and a non-speech [Musical Instrument Sound dataset (Nsyth)] dataset demonstrate that using only 5% of labelled data as guidance, GGAN learns significantly better representations than the state-of-the-art models.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.