Duration modeling using DNN for Arabic speech synthesis

Zangar, Imene; Mnasri, Zied; Colotte, Vincent; Jouvet, Denis; Houidhek, Amal

doi:10.21437/speechprosody.2018-121

Cited by 12 publications

(7 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…So far, this work has been confined to American English. We might speculate that duration information will be particularly useful for ASR in languages such as Japanese, Finnish, Estonian and Arabic [21] that have phonemic length.…”

Section: Discussionmentioning

confidence: 99%

Neural Network-Based Modeling of Phonetic Durations

2019

View full text Add to dashboard Cite

A deep neural network (DNN)-based model has been developed to predict non-parametric distributions of durations of phonemes in specified phonetic contexts and used to explore which factors influence durations most. Major factors in US English are pre-pausal lengthening, lexical stress, and speaking rate. The model can be used to check that text-to-speech (TTS) training speech follows the script and words are pronounced as expected. Duration prediction is poorer with training speech for automatic speech recognition (ASR) because the training corpus typically consists of single utterances from many speakers and is often noisy or casually spoken. Low probability durations in ASR training material nevertheless mostly correspond to non-standard speech, with some having disfluencies. Children's speech is disproportionately present in these utterances, since children show much more variation in timing.

show abstract

Section: Discussionmentioning

confidence: 99%

Neural Network-Based Modeling of Phonetic Durations

2019

View full text Add to dashboard Cite

show abstract

“…The prosodic parameters determine speech rhythm and accentuation. For some languages, duration also plays a role in distinguishing the meaning of speech sounds [4]. Therefore, the accurate modeling and prediction of speech-sound duration is important for ensuring that synthetic speech is well perceived.…”

Section: Introductionmentioning

confidence: 99%

“…Recently, the deep neural network technique has grown so fast that it has become the core in most data-driven systems, including TTS systems [3]. Neural network approaches have also been widely adopted to model duration [6] [7] [4]. However, most approaches [8][3] [6] [7] [4] predict phoneme duration using the full context labels that represent phonemes in context, including linguistic features, such as stress, and positional features, such as the relative positions of different segment levels (phoneme, syllable, and word) inside higher-level segments.…”

Section: Introductionmentioning

confidence: 99%

“…Neural network approaches have also been widely adopted to model duration [6] [7] [4]. However, most approaches [8][3] [6] [7] [4] predict phoneme duration using the full context labels that represent phonemes in context, including linguistic features, such as stress, and positional features, such as the relative positions of different segment levels (phoneme, syllable, and word) inside higher-level segments. A front-end tool is used to extract the contextual features from text and an embedding layer to represent the linguistic features along with duration model training [7].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Duration Modeling with Global Phoneme-Duration Vectors

Shiga²,

Kawai

2019

Interspeech 2019

View full text Add to dashboard Cite

A duration model is a major component in every parametric speech synthesis system. Conventional methods use full contextual labels as features to predict phoneme durations that require morphological analysis of text. By contrast, advances in bidirectional recurrent neural networks (BRNN) and global space vector models make it possible to perform grapheme-tophoneme (G2P) conversion from plain text. In this paper, we investigate duration prediction from plain phonemes instead of using their full contextual labels. We propose a new approach that relies on both BRNN and global space vector representations of phonemes (GPV) and durations (GDV). GPVs represent the statistics of phonemes used in a language, whereas GDVs capture duration variations beyond linguistic features. They are essentially learned from a large-scale text corpus in an unsupervised manner where phonemes are converted by G2P. We conducted experiments on two speech corpora in Korean and Chinese to train BRNN-based models in a supervised manner. An objective evaluation conducted on a set of test sentences demonstrated that the proposed method leads to more accurate modeling of phoneme durations than the baselines.

show abstract

“…To cope with these issues, previous works suggested replacing decision trees by DNN [22] or using external models for duration [21]. Results showed that DNN outperformed HMM in terms of speech quality and naturalness of produced speech for English language [23,19].…”

Section: Introductionmentioning

confidence: 99%

DNN-Based Speech Synthesis for Arabic: Modelling and Evaluation

Houidhek

Colotte

Mnasri

et al. 2018

Statistical Language and Speech Processing

Self Cite

View full text Add to dashboard Cite

This paper investigates the use of deep neural networks (DNN) for Arabic speech synthesis. In parametric speech synthesis, whether HMM-based or DNN-based, each speech segment is described with a set of contextual features. These contextual features correspond to linguistic, phonetic and prosodic information that may affect the pronunciation of the segments. Gemination and vowel quantity (short vowel vs. long vowel) are two particular and important phenomena in Arabic language. Hence, it is worth investigating if those phenomena must be handled by using specific speech units, or if their specification in the contextual features is enough. Consequently four modelling approaches are evaluated by considering geminated consonants (respectively long vowels) either as fully-fledged phoneme units or as the same phoneme as their simple (respectively short) counterparts. Although no significant difference has been observed in previous studies relying on HMM-based modelling, this paper examines these modelling variants in the framework of DNNbased speech synthesis. Listening tests are conducted to evaluate the four modelling approaches, and to assess the performance of DNN-based Arabic speech synthesis with respect to previous HMM-based approach.

show abstract

Duration modeling using DNN for Arabic speech synthesis

Cited by 12 publications

References 14 publications

Neural Network-Based Modeling of Phonetic Durations

Neural Network-Based Modeling of Phonetic Durations

Duration Modeling with Global Phoneme-Duration Vectors

DNN-Based Speech Synthesis for Arabic: Modelling and Evaluation

Contact Info

Product

Resources

About