2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2015
DOI: 10.1109/icassp.2015.7178814
|View full text |Cite
|
Sign up to set email alerts
|

Deep neural networks employing Multi-Task Learning and stacked bottleneck features for speech synthesis

Abstract: Deep neural networks (DNNs) use a cascade of hidden representations to enable the learning of complex mappings from input to output features. They are able to learn the complex mapping from textbased linguistic features to speech acoustic features, and so perform text-to-speech synthesis. Recent results suggest that DNNs can produce more natural synthetic speech than conventional HMM-based statistical parametric systems. In this paper, we show that the hidden representation used within a DNN can be improved th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
161
1
1

Year Published

2015
2015
2022
2022

Publication Types

Select...
5
5

Relationship

1
9

Authors

Journals

citations
Cited by 225 publications
(164 citation statements)
references
References 19 publications
1
161
1
1
Order By: Relevance
“…The use of bottleneck features for modelling these dependencies seems reasonable. Bottleneck features have been widely used in speech recognition [15,16] and text-to-speech systems [17]. They can be used in a similar way for speech-driven head motion synthesis.…”
Section: Bottleneck Featuresmentioning
confidence: 99%
“…The use of bottleneck features for modelling these dependencies seems reasonable. Bottleneck features have been widely used in speech recognition [15,16] and text-to-speech systems [17]. They can be used in a similar way for speech-driven head motion synthesis.…”
Section: Bottleneck Featuresmentioning
confidence: 99%
“…Therefore, the SGD training can be done with the weighted sum of the task-specific objectives for each training sample, and the language model objective can be thought of as a regularization term. Similar settings of multitask learning for neural network models are employed in phoneme recognition for speech (Seltzer and Droppo, 2013) and speech synthesis (Wu et al, 2015) as well, but both of them use equal weights for all tasks.…”
Section: Related Workmentioning
confidence: 99%
“…The database consists of 2400 utterances for training, 70 for testing, recorded with a sample rate of 16kHz. The input features consist of 160 bottleneck features [23] as a compact, learned linguistic representation. For spectral features, either i) 50 regularized discrete cepstra (RDC) extracted from the amplitudes of the harmonic dynamic model (HDM) [24] or ii) 50 highly correlated log amplitudes from the perceptual dynamic sinusoidal model (PDM) [25] are used as real-valued spectral output.…”
Section: System Configurationmentioning
confidence: 99%