ExpressGesture: Expressive gesture generation from speech through database matching

Ferstl, Ylva; Neff, Michael; McDonnell, Rachel

doi:10.1002/cav.2016

Cited by 25 publications

(8 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…beat gestures) in context of acoustic prosody has been studied heavily since the era of rule based gesture synthesis [CMM99, MXL * 13]. Fast forward to approaches with data-driven synthesis, some explicitly rely on extracted prosodic features [FNM21], while others [GBK * 19,ALNM20] learn implicit embeddings from acoustics which prosody is one of the key components. It seems clear that gesture production must be grounded in the rhythm of audio data, and appropriate beat gestures will be challenging to achieve from text transcriptions alone, without timing information [KNN * 22].…”

Section: Multimodal Groundingmentioning

confidence: 99%

A Comprehensive Review of Data-Driven Co-Speech Gesture Generation

Nyatsanga¹,

Kucherenko²,

Ahuja³

et al. 2023

Preprint

View full text Add to dashboard Cite

Figure 1: Co-speech gesture generation approaches can be divided into rule-based and data-driven. Rule-based systems use carefully designed heuristics to associate speech with gesture (Section 4). Data-driven approaches associate speech and gesture through statistical modeling (Section 5.2), or by learning multimodal representations using deep generative models (Section 5.3). The main input modalities are speech audio in an intermediate representation; text transcript of speech; humanoid pose in joint position or angle form; and control parameters for motion design intent. Virtual agents and social robotics are the main research applications, although also compatible with games and film VFX.

show abstract

Section: Multimodal Groundingmentioning

confidence: 99%

A Comprehensive Review of Data-Driven Co-Speech Gesture Generation

Nyatsanga¹,

Kucherenko²,

Ahuja³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…beat gestures) in context of acoustic prosody has been studied heavily since the era of rule based gesture synthesis [CMM99, MXL*13]. Fast forward to approaches with data‐driven synthesis, some explicitly rely on extracted prosodic features [FNM21], while others [GBK*19, ALNM20] learn implicit embeddings from acoustics which prosody is one of the key components. It seems clear that gesture production must be grounded in the rhythm of audio data, and appropriate beat gestures will be challenging to achieve from text transcriptions alone, without timing information [KNN*22].…”

Section: Key Challenges Of Gesture Generationmentioning

confidence: 99%

A Comprehensive Review of Data‐Driven Co‐Speech Gesture Generation

Nyatsanga

Kucherenko²,

Ahuja³

et al. 2023

Computer Graphics Forum

View full text Add to dashboard Cite

Gestures that accompany speech are an essential part of natural and efficient embodied human communication. The automatic generation of such co‐speech gestures is a long‐standing problem in computer animation and is considered an enabling technology for creating believable characters in film, games, and virtual social spaces, as well as for interaction with social robots. The problem is made challenging by the idiosyncratic and non‐periodic nature of human co‐speech gesture motion, and by the great diversity of communicative functions that gestures encompass. The field of gesture generation has seen surging interest in the last few years, owing to the emergence of more and larger datasets of human gesture motion, combined with strides in deep‐learning‐based generative models that benefit from the growing availability of data. This review article summarizes co‐speech gesture generation research, with a particular focus on deep generative models. First, we articulate the theory describing human gesticulation and how it complements speech. Next, we briefly discuss rule‐based and classical statistical gesture synthesis, before delving into deep learning approaches. We employ the choice of input modalities as an organizing principle, examining systems that generate gestures from audio, text and non‐linguistic input. Concurrent with the exposition of deep learning approaches, we chronicle the evolution of the related training data sets in terms of size, diversity, motion quality, and collection method (e.g., optical motion capture or pose estimation from video). Finally, we identify key research challenges in gesture generation, including data availability and quality; producing human‐like motion; grounding the gesture in the co‐occurring speech in interaction with other speakers, and in the environment; performing gesture evaluation; and integration of gesture synthesis into applications. We highlight recent approaches to tackling the various key challenges, as well as the limitations of these approaches, and point toward areas of future development.

show abstract

“…For example, the work of Zhuang et al [76] uses a transformer-based encoder-decoder for face animation and a motion graph retrieval module for body animation. Another example is the work of Ferstl et al [19], who generates parameters such as acceleration or velocity of motion from the audio, before finding a corresponding motion in a database.…”

Section: Data-driven Approachesmentioning

confidence: 99%

Towards the generation of synchronized and believable non-verbal facial behaviors of a talking virtual agent

Delbosc,

Ochs,

Sabouret

et al. 2023

International Cconference on Multimodal Interaction

View full text Add to dashboard Cite

This paper introduces a new model to generate rhythmically relevant non-verbal facial behaviors for virtual agents while they speak. The model demonstrates perceived performance comparable to behaviors directly extracted from the data and replayed on a virtual agent, in terms of synchronization with speech and believability. Interestingly, we found that training the model with two different sets of data, instead of one, did not necessarily improve its performance. The expressiveness of the people in the dataset and the shooting conditions are key elements. We also show that employing an adversarial model, in which fabricated fake examples are introduced during the training phase, increases the perception of synchronization with speech. A collection of videos demonstrating the results and code can be accessed at: https://github.com/aldelb/non_verbal_facial_animation. CCS CONCEPTS• Computing methodologies → Neural networks; Animation.

show abstract

ExpressGesture: Expressive gesture generation from speech through database matching

Cited by 25 publications

References 25 publications

A Comprehensive Review of Data-Driven Co-Speech Gesture Generation

A Comprehensive Review of Data-Driven Co-Speech Gesture Generation

A Comprehensive Review of Data‐Driven Co‐Speech Gesture Generation

Towards the generation of synchronized and believable non-verbal facial behaviors of a talking virtual agent

Contact Info

Product

Resources

About