Gesticulator: A framework for semantically-aware speech-driven gesture generation

Kucherenko, Taras; Jonell, Patrik; Waveren, Sanne van; Henter, Gustav Eje; Alexandersson, Simon; Leite, Iolanda; Kjellström, Hedvig

doi:10.1145/3382507.3418815

Cited by 119 publications

(121 citation statements)

References 37 publications

Supporting

Mentioning

115

Contrasting

Order By: Relevance

“…Previous studies suggest that motion quality (human-likeness) may influence gesture appropriateness ratings in subjective evaluations [31,61]. Our experiments only partly managed to separate these two aspects of gesture perception.…”

Section: Discussion Of the Challenge Resultsmentioning

confidence: 75%

See 1 more Smart Citation

A Large, Crowdsourced Evaluation of Gesture Generation Systems on Common Data: The GENEA Challenge 2020

Kucherenko

Jonell

Yoon

et al. 2021

26th International Conference on Intelligent User Interfaces

Self Cite

View full text Add to dashboard Cite

Co-speech gestures, gestures that accompany speech, play an important role in human communication. Automatic co-speech gesture generation is thus a key enabling technology for embodied conversational agents (ECAs), since humans expect ECAs to be capable of multi-modal communication. Research into gesture generation is rapidly gravitating towards data-driven methods. Unfortunately, individual research efforts in the field are difficult to compare: there are no established benchmarks, and each study tends to use its own dataset, motion visualisation, and evaluation methodology. To address this situation, we launched the GENEA Challenge, a gesturegeneration challenge wherein participating teams built automatic gesture-generation systems on a common dataset, and the resulting systems were evaluated in parallel in a large, crowdsourced user study using the same motion-rendering pipeline. Since differences in evaluation outcomes between systems now are solely attributable to differences between the motion-generation methods, this enables benchmarking recent approaches against one another in order to get a better impression of the state of the art in the field. This paper reports on the purpose, design, results, and implications of our challenge.

show abstract

Section: Discussion Of the Challenge Resultsmentioning

confidence: 75%

“…The distance between speed histograms has also been used to evaluate gesture quality [29,31], since well-trained models should produce motion with similar properties to that of the actor it was trained on. In particular, it should have a similar motion-speed profile for any given joint.…”

Section: Comparing Speed Histogramsmentioning

confidence: 99%

A Large, Crowdsourced Evaluation of Gesture Generation Systems on Common Data: The GENEA Challenge 2020

Kucherenko

Jonell

Yoon

et al. 2021

26th International Conference on Intelligent User Interfaces

Self Cite

View full text Add to dashboard Cite

show abstract

“…Moving forward, neural networks were employed to predict a sequence of frames for gestures (Hasegawa et al, 2018), head motions (Sadoughi and Busso, 2018) and body motions (Shlizerman et al, 2018;Ahuja et al, 2019;Ginosar et al, 2019;Ferstl et al, 2019) conditioned on a speech input while Yoon et al (2019) uses only a text input. Unlike these approaches, Kucherenko et al (2020) rely on both speech and language for gesture generation. But their choice of early fusion to com-bine the modalities ignores multi-scale correlations (Tsai et al, 2019) between speech and language.…”

Section: Related Workmentioning

confidence: 99%

“…Expressivity Naturalness Relevance Timing S2G (Ginosar et al, 2019) 24.6 ± 3.1 22.1 ± 1.8 22.4 ± 1.7 27.6 ± 1.7 Gesticulator (Kucherenko et al, 2020) 31.9 ± 2.0 32.1 ± 1.7 31.4 ± 1.8 31.1 ± 1.7 Ours w/o G attn 35.0 ± 2.3 29.2 ± 1.7 30.9 ± 1.8 30.8 ± 1.7 Ours w/o AISLe 35.8 ± 2.9 35.7 ± 1.7 33.7 ± 1.7 32.1 ± 1.7 Ours 38.9 ± 1.7 36.7 ± 1.6 37.1 ± 1.7 35.3 ± 1.7…”

Section: Modelsmentioning

confidence: 99%

“…no multimodal attention block) and any form of re-sampling while training. Gesticulator (Kucherenko et al, 2020): Unlike MMS-Transformer , Gesticulator has a set of fully connected layer followed by autoregressive fully connected layers which are FiLM conditioned (Perez et al, 2018). In addition to audio and text, features of duration of each word (i.e.…”

Section: Baseline Modelsmentioning

confidence: 99%

See 1 more Smart Citation

No Gestures Left Behind: Learning Relationships between Spoken Language and Freeform Gestures

Ahuja¹,

Lee²,

Ishii³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

We study relationships between spoken language and co-speech gestures in context of two key challenges. First, distributions of text and gestures are inherently skewed making it important to model the long tail. Second, gesture predictions are made at a subword level, making it important to learn relationships between language and acoustic cues. We introduce Adversarial Importance Sampled Learning (or AISLe), which combines adversarial learning with importance sampling to strike a balance between precision and coverage. We propose the use of a multimodal multiscale attention block to perform subword alignment without the need of explicit alignment between language and acoustic cues. Finally, to empirically study the importance of language in this task, we extend the dataset proposed in Ahuja et al. (2020) with automatically extracted transcripts for audio signals. We substantiate the effectiveness of our approach through large-scale quantitative and user studies, which show that our proposed methodology significantly outperforms previous state-of-the-art approaches for gesture generation. Link to code, data and videos: https: //github.com/chahuja/aisle

show abstract

ExpressGesture: Expressive gesture generation from speech through database matching

Ferstl

Neff

McDonnell

2021

Computer Animation & Virtual

View full text Add to dashboard Cite

Co-speech gestures are a vital ingredient in making virtual agents more human-like and engaging. Automatically generated gestures based on speech-input often lack realistic and defined gesture form. We present a database-driven approach guaranteeing defined gesture form. We built a large corpus of over 23,000 motion-captured co-speech gestures and select individual gestures based on expressive gesture characteristics that can be estimated from speech audio. The expressive parameters are gesture velocity and acceleration, gesture size, arm swivel, and finger extension. Individual, parameter-matched gestures are then combined into animated sequences. We evaluate our gesture generation system in two perceptual studies. The first study compares our method to the ground truth gestures as well as mismatched gestures. The second study compares our method to five current generative machine learning models. Our method outperformed mismatched gesture selection in the first study and showed competitive performance in the second.

show abstract

Gesticulator: A framework for semantically-aware speech-driven gesture generation

Cited by 119 publications

References 37 publications

A Large, Crowdsourced Evaluation of Gesture Generation Systems on Common Data: The GENEA Challenge 2020

A Large, Crowdsourced Evaluation of Gesture Generation Systems on Common Data: The GENEA Challenge 2020

No Gestures Left Behind: Learning Relationships between Spoken Language and Freeform Gestures

ExpressGesture: Expressive gesture generation from speech through database matching

Contact Info

Product

Resources

About