A fixed dimension and perceptually based dynamic sinusoidal model of speech

Hu, Qiong; Stylianou, Yannis; Richmond, Korin; Maia, Ranniery; Yamagishi, Junichi; Latorre, Javier

doi:10.1109/icassp.2014.6854810

Cited by 5 publications

(10 citation statements)

References 11 publications

(16 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the DIR method, we model log|A| and log|B| explicitly. For this, we proposed in [11] synthesis, HDM is used for generating speech, where amplitudes at each harmonic (|A HDM | , |B HDM |) are assigned the amplitude of the centre frequency of the critical band in which they lie. Figure 1 gives an overview of both methods for integrating the DSM into DNN-based speech synthesis (see [12] for more detail).…”

Section: Methods For Dsm Parameterisationmentioning

confidence: 99%

Fusion of multiple parameterisations for DNN-based sinusoidal speech synthesis with multi-task learning

Hu¹,

Wu²,

Richmond³

et al. 2015

Interspeech 2015

Self Cite

View full text Add to dashboard Cite

It has recently been shown that deep neural networks (DNN) can improve the quality of statistical parametric speech synthesis (SPSS) when using a source-filter vocoder. Our own previous work has furthermore shown that a dynamic sinusoidal model (DSM) is also highly suited to DNN-based SPSS, whereby sinusoids may either be used themselves as a "direct parameterisation" (DIR), or they may be encoded using an "intermediate spectral parameterisation" (INT). The approach in that work was effectively to replace a decision tree with a neural network. However, waveform parameterisation and synthesis steps that have been developed to suit HMMs may not fully exploit DNN capabilities. Here, in contrast, we investigate ways to combine INT and DIR at the levels of both DNN modelling and waveform generation. For DNN training, we propose to use multi-task learning to model cepstra (from INT) and log amplitudes (from DIR) as primary and secondary tasks. Our results show combining these improves modelling accuracy for both tasks. Next, during synthesis, instead of discarding parameters from the second task, a fusion method using harmonic amplitudes derived from both tasks is applied. Preference tests show the proposed method gives improved performance, and that this applies to synthesising both with and without global variance parameters.

show abstract

Section: Methods For Dsm Parameterisationmentioning

confidence: 99%

Fusion of multiple parameterisations for DNN-based sinusoidal speech synthesis with multi-task learning

Hu¹,

Wu²,

Richmond³

et al. 2015

Interspeech 2015

Self Cite

View full text Add to dashboard Cite

show abstract

“…number of sinusoids) is higher than typical source-filter ones and varies from frame to frame. To address this problem, a perceptual dynamic sinusoidal model (PDM) [17] has been proposed to generate high quality speech with a fixed and low number of parameters.…”

Section: Introductionmentioning

confidence: 99%

“…In addition, [17] has shown that incorporating the dynamic slope of sinusoids can greatly improve quality in copy synthesis. It is natural, therefore, to consider including this dynamic feature for statistical modelling too.…”

Section: Introductionmentioning

confidence: 99%

An investigation of the application of dynamic sinusoidal models to statistical parametric speech synthesis

Stylianou

Maia

et al. 2014

Interspeech 2014

Self Cite

View full text Add to dashboard Cite

This paper applies a dynamic sinusoidal synthesis model to statistical parametric speech synthesis (HTS). For this, we utilise regularised cepstral coefficients to represent both the static amplitude and dynamic slope of selected sinusoids for statistical modelling. During synthesis, a dynamic sinusoidal model is used to reconstruct speech. A preference test is conducted to compare the selection of different sinusoids for cepstral representation. Our results show that when integrated with HTS, a relatively small number of sinusoids selected according to a perceptual criterion can produce quality comparable to using all harmonics. A Mean Opinion Score (MOS) test shows that our proposed statistical system is preferred to one using mel-cepstra from pitch synchronous spectral analysis.

show abstract

“…For spectral features, either i) 50 regularized discrete cepstra (RDC) extracted from the amplitudes of the harmonic dynamic model (HDM) [24] or ii) 50 highly correlated log amplitudes from the perceptual dynamic sinusoidal model (PDM) [25] are used as real-valued spectral output. 50 complex amplitudes with minimum phase extracted from PDM [19] are applied as complex-valued spectral output. Continuous logF 0 and a voiced/unvoiced (vuv) binary value together with either type of these spectral features are used to represent output features (total dimensions: 52).…”

Section: System Configurationmentioning

confidence: 99%

“…This is motivated by the fact that for real-valued classification tasks, a CVNN has the same performance as a real-valued NN with a larger number of neurons [18]. Note that speech synthesis is a regression task, which is different from tasks reported in the literature; iii) Complex amplitudes extracted from [19] can be used as complex-valued outputs where phase is composed of linear phase, minimum phase and disperse phase. Here, linear phase should be omitted in the calculation of the amplitudephase objective function since analysis window position is unrelated to linguistic input.…”

Section: Introductionmentioning

confidence: 99%

Initial investigation of speech synthesis based on complex-valued neural networks

Yamagishi

Richmond

et al. 2016

2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Although frequency analysis often leads us to a speech signal in the complex domain, the acoustic models we frequently use are designed for real-valued data. Phase is usually ignored or modelled separately from spectral amplitude. Here, we propose a complex-valued neural network (CVNN) for directly modelling the results of the frequency analysis in the complex domain (such as the complex amplitude). We also introduce a phase encoding technique to map real-valued data (e.g. cepstra or log amplitudes) into the complex domain so we can use the same CVNN processing seamlessly. In this paper, a fully complex-valued neural network, namely a neural network where all of the weight matrices, activation functions and learning algorithms are in the complex domain, is applied for speech synthesis. Results show its ability to model both complex-valued and real-valued data.

show abstract

A fixed dimension and perceptually based dynamic sinusoidal model of speech

Cited by 5 publications

References 11 publications

Fusion of multiple parameterisations for DNN-based sinusoidal speech synthesis with multi-task learning

Fusion of multiple parameterisations for DNN-based sinusoidal speech synthesis with multi-task learning

An investigation of the application of dynamic sinusoidal models to statistical parametric speech synthesis

Initial investigation of speech synthesis based on complex-valued neural networks

Contact Info

Product

Resources

About