Proceedings of the 18th International Conference on Intelligent Virtual Agents 2018
DOI: 10.1145/3267851.3267898
|View full text |Cite
|
Sign up to set email alerts
|

Investigating the use of recurrent motion modelling for speech gesture generation

Abstract: The growing use of virtual humans demands generating increasingly realistic behavior for them while minimizing cost and time. Gestures are a key ingredient for realistic and engaging virtual agents and consequently automatized gesture generation has been a popular area of research. So far, good gesture generation has relied on explicit formulation of if-then rules and probabilistic modelling of annotated features. Machine learning approaches have yielded only marginal success, indicating a high complexity of t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
79
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
2
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 95 publications
(79 citation statements)
references
References 36 publications
(43 reference statements)
0
79
0
Order By: Relevance
“…Yoon et al [48] include a velocity penalty in training that discourages jerky motion. The recurrent connections used in several models [13,19,48] can also act as a pose memory that may help the model to produce smooth output motion. Autoregressive motion models have recently demonstrated promising results in probabilistic audio-driven gesture generation [2].…”
Section: Regarding Motion Continuitymentioning
confidence: 99%
See 1 more Smart Citation
“…Yoon et al [48] include a velocity penalty in training that discourages jerky motion. The recurrent connections used in several models [13,19,48] can also act as a pose memory that may help the model to produce smooth output motion. Autoregressive motion models have recently demonstrated promising results in probabilistic audio-driven gesture generation [2].…”
Section: Regarding Motion Continuitymentioning
confidence: 99%
“…(1) the first data-driven model that maps speech acoustic and semantic features into continuous 3D gestures; (2) a comparison contrasting the effects of different architectures and important modelling choices; (3) objective and subjective evaluations of the effect of the two speech modalities -audio and semantics -on the resulting gestures. We additionally extend a publicly available corpus of 3D cospeech gestures, the Trinity College dataset [13], with manual text transcriptions. Video samples from our evaluations are provided at vimeo.com/showcase/6737868.…”
mentioning
confidence: 99%
“…With the rise of machine learning, numerous network types have been investigated, including variations of hidden Markov models [3,23], conditional random fields [7,22], and restricted Boltzmann machines [6]. In recent work, recurrent neural networks have proven popular; a classic training loss has been employed for English [12,21] and Japanese speech-to-gesture generation [18,20]. To combat the problem of mean pose regression in a standard training paradigm, an adversarial training paradigm has been proposed in [14] (similarly for a convolutional network setup in [15]), and recently, probabilistic generative modelling has shown promise [1].…”
Section: Related Workmentioning
confidence: 99%
“…During network training (Sec. 4), we include dataset B, the open-source Trinity Speech-Gesture dataset [12], a similar corpus of 4 hours of speech and motion data of a different male English speaker (also right-handed). We find that including this dataset improves performance.…”
Section: Dataset and Processingmentioning
confidence: 99%
“…For instance, in the Kopp, Bergmann, Wachsmuth system [64] gesture form was not fixed, but produced from multimodal representations of objects, locations and their spatial relations. More recently, Ferstl and McDonnel [35] used a recurrent neural network to produce gesture motion directly from prosodic speech features. Interestingly, the network was first pre-trained with a motion modelling task before training the final speech-to-gesture model.…”
Section: Can Embodied Agents Produce Multimodal Cues?mentioning
confidence: 99%