Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows

Alexanderson, Simon; Henter, Gustav Eje; Kucherenko, Taras; Beskow, Jonas

doi:10.1111/cgf.13946

Cited by 118 publications

(126 citation statements)

References 37 publications

Supporting

Mentioning

126

Contrasting

Order By: Relevance

“…Future work also involves making the model stochastic (as in [2]), using larger datasets (such as [29]) and further improving the semantic coherence of the gestures, for instance by treating different gesture types separately.…”

Section: Discussionmentioning

confidence: 99%

“…The recurrent connections used in several models [13,19,48] can also act as a pose memory that may help the model to produce smooth output motion. Autoregressive motion models have recently demonstrated promising results in probabilistic audio-driven gesture generation [2]. In this paper, we similarly investigate autoregressive connections for improving motion quality, which explicitly provide the most recent poses as input to the model when generating the next pose.…”

Section: Regarding Motion Continuitymentioning

confidence: 99%

“…Alongside recent advances in deep learning, data-driven approaches have increasingly gained interest for gesture generation [1,27,48]. While early work has considered gesture generation as a classification task which aims to deduce a specified gesture class [9,37], more recent work has considered it as a regression task which aims to produce continuous motion [2,48]. We focus on the latter task: continuous gesture generation.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Gesticulator: A framework for semantically-aware speech-driven gesture generation

Kucherenko

Jonell

Waveren

et al. 2020

Proceedings of the 2020 International Conference on Multimodal Interaction

Self Cite

119

116

View full text Add to dashboard Cite

During speech, people spontaneously gesticulate, which plays a key role in conveying information. Similarly, realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current end-to-end co-speech gesture generation systems use a single modality for representing speech: either audio or text. These systems are therefore confined to producing either acoustically-linked beat gestures or semantically-linked gesticulation (e.g., raising a hand when saying "high"): they cannot appropriately learn to generate both gesture types. We present a model designed to produce arbitrary beat and semantic gestures together. Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output. The resulting gestures can be applied to both virtual agents and humanoid robots. Subjective and objective evaluations confirm the success of our approach. The code and video are available at the project page svito-zar.github.io/gesticulator.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Regarding Motion Continuitymentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Gesticulator: A framework for semantically-aware speech-driven gesture generation

Kucherenko

Jonell

Waveren

et al. 2020

Proceedings of the 2020 International Conference on Multimodal Interaction

Self Cite

119

116

View full text Add to dashboard Cite

show abstract

“…In recent work, recurrent neural networks have proven popular; a classic training loss has been employed for English [12,21] and Japanese speech-to-gesture generation [18,20]. To combat the problem of mean pose regression in a standard training paradigm, an adversarial training paradigm has been proposed in [14] (similarly for a convolutional network setup in [15]), and recently, probabilistic generative modelling has shown promise [1]. However, due to the highly indeterministic input-to-output relation, modelling plausible gestures remains a difficult problem.…”

Section: Related Workmentioning

confidence: 99%

“…(3.1) path length (3.2) major axis length (4) arm swivel (5) hand opening Velocity and initial acceleration both describe the kinematics of the gesture, represented by the maximum stroke velocity (1), and by the mean acceleration to the first major velocity peak (2). Velocity captures a character's tempo and relates to the amount of energy they are using.…”

Section: Gesture Processingmentioning

confidence: 99%

Understanding the Predictability of Gesture Parameters from Speech and their Perceptual Importance

Ferstl

Neff

McDonnell

2020

Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents

View full text Add to dashboard Cite

Gesture behavior is a natural part of human conversation. Much work has focused on removing the need for tedious hand-animation to create embodied conversational agents by designing speechdriven gesture generators. However, these generators often work in a black-box manner, assuming a general relationship between input speech and output motion. As their success remains limited, we investigate in more detail how speech may relate to different aspects of gesture motion. We determine a number of parameters characterizing gesture, such as speed and gesture size, and explore their relationship to the speech signal in a twofold manner. First, we train multiple recurrent networks to predict the gesture parameters from speech to understand how well gesture attributes can be modeled from speech alone. We find that gesture parameters can be partially predicted from speech, and some parameters, such as path length, being predicted more accurately than others, like velocity. Second, we design a perceptual study to assess the importance of each gesture parameter for producing motion that people perceive as appropriate for the speech. Results show that a degradation in any parameter was viewed negatively, but some changes, such as hand shape, are more impactful than others. A video summarization can be found at https://youtu.be/aw6-_5kmLjY. CCS CONCEPTS • Computing methodologies → Animation; Machine learning.

show abstract

ExpressGesture: Expressive gesture generation from speech through database matching

Ferstl

Neff

McDonnell

2021

Computer Animation & Virtual

View full text Add to dashboard Cite

Co-speech gestures are a vital ingredient in making virtual agents more human-like and engaging. Automatically generated gestures based on speech-input often lack realistic and defined gesture form. We present a database-driven approach guaranteeing defined gesture form. We built a large corpus of over 23,000 motion-captured co-speech gestures and select individual gestures based on expressive gesture characteristics that can be estimated from speech audio. The expressive parameters are gesture velocity and acceleration, gesture size, arm swivel, and finger extension. Individual, parameter-matched gestures are then combined into animated sequences. We evaluate our gesture generation system in two perceptual studies. The first study compares our method to the ground truth gestures as well as mismatched gestures. The second study compares our method to five current generative machine learning models. Our method outperformed mismatched gesture selection in the first study and showed competitive performance in the second.

show abstract

Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows

Cited by 118 publications

References 37 publications

Gesticulator: A framework for semantically-aware speech-driven gesture generation

Gesticulator: A framework for semantically-aware speech-driven gesture generation

Understanding the Predictability of Gesture Parameters from Speech and their Perceptual Importance

ExpressGesture: Expressive gesture generation from speech through database matching

Contact Info

Product

Resources

About