2011
DOI: 10.1109/tasl.2010.2052246
|View full text |Cite
|
Sign up to set email alerts
|

Emotional Audio-Visual Speech Synthesis Based on PAD

Abstract: Recent research in zero-shot speech synthesis has made significant progress in speaker similarity. However, current efforts focus on timbre generalization rather than prosody modeling, which results in limited naturalness and expressiveness. To address this, we introduce a novel speech synthesis model trained on large-scale datasets, including both timbre and hierarchical prosody modeling. As timbre is a global attribute closely linked to expressiveness, we adopt a global vector to model speaker timbre while g… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
39
0
1

Year Published

2011
2011
2018
2018

Publication Types

Select...
5
2
2

Relationship

2
7

Authors

Journals

citations
Cited by 52 publications
(41 citation statements)
references
References 24 publications
0
39
0
1
Order By: Relevance
“…Results show that even though the SVR provides the best performance in the validation of each SSRM for both arousal and valence, the PLS algorithm is more robust to overfitting and thus produces significantly improved performance. Our conclusion is that weak predictors are indeed more suitable to perform boosting than more sophisticated algorithms [50].…”
Section: Comparison Between Pls and Svrmentioning
confidence: 75%
See 1 more Smart Citation
“…Results show that even though the SVR provides the best performance in the validation of each SSRM for both arousal and valence, the PLS algorithm is more robust to overfitting and thus produces significantly improved performance. Our conclusion is that weak predictors are indeed more suitable to perform boosting than more sophisticated algorithms [50].…”
Section: Comparison Between Pls and Svrmentioning
confidence: 75%
“…Results confirm that the prediction of arousal from acoustic features provides significantly better results than for valence. The combination of weak predictors (PLS) in the CRM, which is similar to a boosting strategy [50], provides a performance that is comparable with the one obtained with more complex machine learning methods that are trained on a full set of speakers [26], [37].…”
Section: Overall Performance Of the Crmmentioning
confidence: 99%
“…However, the method is presented in accordance with the project about the analysis of the net-mediated public sentiment, so the universality of the method is insufficient, and it is not recommended to apply the method to extract information from non-news webpages. We will set out to extract news videos and pictures from news webpages [6] in the future work.…”
Section: Resultsmentioning
confidence: 99%
“…The PAD model essentially shades the modeling of head and facial gestures from the highlevel text semantics, so that we can focus on mapping the PAD descriptors to visual motion features. Toward the PAD parameterization for input text, we adopt the heuristics that are proposed in the PAD based expressive text-to-speech synthesis [24,38]. To extend our approach to talking avatar in other languages, similar PAD parameterizations need to be devised according to the specific language.…”
Section: Discussionmentioning
confidence: 99%