Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-1144
|View full text |Cite
|
Sign up to set email alerts
|

Discrete Duration Model for Speech Synthesis

Abstract: The acoustic model and the duration model are the two major components in statistical parametric speech synthesis (SPSS) systems. The neural network based acoustic model makes it possible to model phoneme duration at phone-level instead of state-level in conventional hidden Markov model (HMM) based SPSS systems. Since the duration of phonemes is countable value, the distribution of the phone-level duration is discrete given the linguistic features, which means the Gaussian hypothesis is no longer necessary. Th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
18
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 14 publications
(18 citation statements)
references
References 14 publications
0
18
0
Order By: Relevance
“…Technically, with advances in learning a continuous space vector representation of words in an unsupervised manner [10], large amounts of unannotated data can be used to improve TTS components [11][12] [13]. Moreover, a bidirectional recurrent neural network (BRNN) [14] [15] and bidirectional long short-term memory are well known to be powerful for sequence modeling [7] [16] [17]. This leads us to consider a basic approach for predicting duration using phonemes only but with advanced technology for sequence modeling and unsupervised learning.…”
Section: (U N S U P E R V Is E D )mentioning
confidence: 99%
See 3 more Smart Citations
“…Technically, with advances in learning a continuous space vector representation of words in an unsupervised manner [10], large amounts of unannotated data can be used to improve TTS components [11][12] [13]. Moreover, a bidirectional recurrent neural network (BRNN) [14] [15] and bidirectional long short-term memory are well known to be powerful for sequence modeling [7] [16] [17]. This leads us to consider a basic approach for predicting duration using phonemes only but with advanced technology for sequence modeling and unsupervised learning.…”
Section: (U N S U P E R V Is E D )mentioning
confidence: 99%
“…Recently, the deep neural network technique has grown so fast that it has become the core in most data-driven systems, including TTS systems [3]. Neural network approaches have also been widely adopted to model duration [6] [7] [4]. However, most approaches [8][3] [6] [7] [4] predict phoneme duration using the full context labels that represent phonemes in context, including linguistic features, such as stress, and positional features, such as the relative positions of different segment levels (phoneme, syllable, and word) inside higher-level segments.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…Furthermore, recurrent networks architecture, like long short-term memory (LSTM) and Bidirectional-LSTM (BLSTM) are powerful models in sequential modeling. Therefore, deep LSTM and BLSTM are investigated to model the relationship between linguistic features and phoneme duration [24]. The input features could be the same as those used in HTS, since they cover most of the phonological, linguistic and contextual data with the addition of Arabic specific features regarding vowel quantity and consonant gemination.…”
Section: Duration Modeling Based On Dnnmentioning
confidence: 99%