Creating synthetic voices for children by adapting adult average voice using stacked transformations and VTLN

Karhila, Reima; Sanand, D. Rama; Kurimo, Mikko; Smit, Peter

doi:10.1109/icassp.2012.6288918

Cited by 5 publications

(10 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, it was refreshing to discover studies focusing on other variants of English such as Irish English [41] and Indian English [19,42]. Although less common, researchers also considered other languages when experimenting with child-speech synthesis, including Norwegian [5], Spanish [18], Punjabi [43], Finnish [44], German [45,46], Czech and Slovak [47], Mandarin [27,31] and quite often, Italian [21,29,30,[48][49][50][51][52].…”

Section: Languagementioning

confidence: 99%

“…Interestingly, aside from creating a child voice by adapting an average adult voice or an average child voice, Karhila et al's study [44] compared two additional adaptation methods using stacked transformations: StA and StVA. In the first method, StA, an average voice trained from adult data was adapted using training data of the average child voice.…”

Section: Speech-synthesis Systemsmentioning

confidence: 99%

“…This model was then further adapted to resemble a specific target child speaker. In the second method, StVA, an average voice trained from adult data was adapted using training data of the average child voice, and then VTLN occurred [44]. It was found that stacked transformation systems (StA and StVA) were preferred by listeners and resulted in better adapted voices than directly adapting the average adult voice or the child voice [44].…”

Section: Speech-synthesis Systemsmentioning

confidence: 99%

“…In addition, many researchers have attempted to build child-speech models by adapting adult-speech models. This has been proven to be a viable method when there are limited speech data available [5,18,[23][24][25][26]37,44,48]. In a study by Hagen, Pellom and Hacioglu [18], a synthetic children's model was derived without child-speech data by using adult-speech data.…”

Section: Child-speech Datamentioning

confidence: 99%

See 3 more Smart Citations

A Situational Analysis of Current Speech-Synthesis Systems for Child Voices: A Scoping Review of Qualitative and Quantitative Evidence

et al. 2022

View full text Add to dashboard Cite

(1) Background: Speech synthesis has customarily focused on adult speech, but with the rapid development of speech-synthesis technology, it is now possible to create child voices with a limited amount of child-speech data. This scoping review summarises the evidence base related to developing synthesised speech for children. (2) Method: The included studies were those that were (1) published between 2006 and 2021 and (2) included child participants or voices of children aged between 2–16 years old. (3) Results: 58 studies were identified. They were discussed based on the languages used, the speech-synthesis systems and/or methods used, the speech data used, the intelligibility of the speech and the ages of the voices. Based on the reviewed studies, relative to adult-speech synthesis, developing child-speech synthesis is notably more challenging. Child speech often presents with acoustic variability and articulatory errors. To account for this, researchers have most often attempted to adapt adult-speech models, using a variety of different adaptation techniques. (4) Conclusions: Adapting adult speech has proven successful in child-speech synthesis. It appears that the resulting quality can be improved by training a large amount of pre-selected speech data, aided by a neural-network classifier, to better match the children’s speech. We encourage future research surrounding individualised synthetic speech for children with CCN, with special attention to children who make use of low-resource languages.

show abstract

Section: Languagementioning

confidence: 99%

Section: Speech-synthesis Systemsmentioning

confidence: 99%

Section: Speech-synthesis Systemsmentioning

confidence: 99%

Section: Child-speech Datamentioning

confidence: 99%

See 2 more Smart Citations

A Situational Analysis of Current Speech-Synthesis Systems for Child Voices: A Scoping Review of Qualitative and Quantitative Evidence

et al. 2022

View full text Add to dashboard Cite

show abstract

“…Another technique is adaptive voice conversion, which can be used to dub children's voices in children's movie applications. Techniques for creating or generating children's voices have been proposed in the study [12], [13], [14], [15]. Watts et al [14], [15] proposed the Hidden Markov Model (HMM) as a basis for a method of synthesizing children's voices.…”

Section: Introductionmentioning

confidence: 99%

Voice Conversion for Dubbing Using Linear Predictive Coding and Hidden Markov Model

Mukhneri¹,

Wijayanto²,

Hadiyoso³

2020

Journal of Southwest Jiaotong University

View full text Add to dashboard Cite

Dubbing is a term used to describe filling in the sound on film or video. Voice conversion can be done to support dubbing, for purposes such as obtaining a child’s voice for dubbing on children’s films. However, problems frequently occur with this process, including difficulty finding children’s voice resources and difficulty getting children to express the desired tone and mood while recording. Therefore, in this study, we propose a method for creating a cross-gender and age voice conversion from adult voices to children’s voices. The feature extraction method that is used is Linear Predictive Coding, and the modeling method is the Hidden Markov Model. The parts synthesized are fundamental frequency (F0) and spectral content. From the simulation test, the best results for the voice conversion are achieved by Linear Predictive Coding order 19. The best state of Hidden Markov Model modeling is the 5th state. F0 Root Mean Square Error of adult men to children after the conversion increased by 57.7%, while the F0 Root Mean Square Error of adult women to children after the conversion increased by 15.29%. Root Mean Square Error Cepstral after conversion increased by 43.69%. A subjective test was also performed in terms of the mean opinion score. In terms of similarities, mean opinion score testing for Hidden Markov Model has an average value of 2.64, and in terms of quality, testing mean opinion score for Hidden Markov Model has an average value of 3.23. It is hoped that this proposed method can be used in real terms for dubbing in the film industry, especially for Indonesian dialogue.

show abstract

Combining Vocal Tract Length Normalization With Hierarchical Linear Transformations

Saheer

Yamagishi

Garner

et al. 2014

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

Creating synthetic voices for children by adapting adult average voice using stacked transformations and VTLN

Cited by 5 publications

References 9 publications

A Situational Analysis of Current Speech-Synthesis Systems for Child Voices: A Scoping Review of Qualitative and Quantitative Evidence

A Situational Analysis of Current Speech-Synthesis Systems for Child Voices: A Scoping Review of Qualitative and Quantitative Evidence

Voice Conversion for Dubbing Using Linear Predictive Coding and Hidden Markov Model

Combining Vocal Tract Length Normalization With Hierarchical Linear Transformations

Contact Info

Product

Resources

About