ZeroEGGS: Zero‐shot Example‐based Gesture Generation from Speech

Ghorbani, Saeed; Ferstl, Ylva; Holden, Daniel; Troje, Nikolaus F.; Carbonneau, Marc‐André

doi:10.1111/cgf.14734

Cited by 36 publications

(31 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In a similar vein, Ghorbani et al [GFC22, GFH*23], used a VAE‐based framework for style controllable co‐speech gesture generation conditioned by a zero‐shot motion example i.e., an instance of a motion style unseen during training. Given an audio input and a motion example, they generated an encoding of the audio and a style embedding from the motion, and the two latent codes were used to guide the generation of stylized gestures.…”

Section: Data‐driven Approachesmentioning

confidence: 99%

“…Style specification is also not data efficient, requiring as many samples as the size of the training set for the model to learn a style [AHKB20, ALNM20]. We conclude this section by discussing several works that proposed approaches for data‐efficient style specification [GFC22, GFH*23, FGPO22, ALM22].…”

Section: Data‐driven Approachesmentioning

confidence: 99%

“…A deeper analysis of the impact of speaker identity input [KNN*22] shows that different speakers have different gesture‐property prediction certainty, evoking even more interest in the idiosyncrasies of co‐speech gestures. More recently, it was also shown that a short motion clip can be used for style control in “zero‐shot style adaptation” [FGPO22, GFC22, GFH*23]. For many applications, it is desirable for the designer to be able to control the nature of the motion.…”

Section: Key Challenges Of Gesture Generationmentioning

confidence: 99%

“…Ghorbani et al [GFC22, GFH*23] proposed a framework that improves on high‐level style portrayal by using exemplar motion sequences that demonstrate the intended stylistic expression of gesture motion. Their framework was able to efficiently extract style parameters in a zero‐shot manner, only requiring a single example motion and was able to generalize to example motions (and therefore styles) unseen during training.…”

Section: Data‐driven Approachesmentioning

confidence: 99%

See 3 more Smart Citations

A Comprehensive Review of Data‐Driven Co‐Speech Gesture Generation

Nyatsanga

Kucherenko²,

Ahuja³

et al. 2023

Computer Graphics Forum

View full text Add to dashboard Cite

Gestures that accompany speech are an essential part of natural and efficient embodied human communication. The automatic generation of such co‐speech gestures is a long‐standing problem in computer animation and is considered an enabling technology for creating believable characters in film, games, and virtual social spaces, as well as for interaction with social robots. The problem is made challenging by the idiosyncratic and non‐periodic nature of human co‐speech gesture motion, and by the great diversity of communicative functions that gestures encompass. The field of gesture generation has seen surging interest in the last few years, owing to the emergence of more and larger datasets of human gesture motion, combined with strides in deep‐learning‐based generative models that benefit from the growing availability of data. This review article summarizes co‐speech gesture generation research, with a particular focus on deep generative models. First, we articulate the theory describing human gesticulation and how it complements speech. Next, we briefly discuss rule‐based and classical statistical gesture synthesis, before delving into deep learning approaches. We employ the choice of input modalities as an organizing principle, examining systems that generate gestures from audio, text and non‐linguistic input. Concurrent with the exposition of deep learning approaches, we chronicle the evolution of the related training data sets in terms of size, diversity, motion quality, and collection method (e.g., optical motion capture or pose estimation from video). Finally, we identify key research challenges in gesture generation, including data availability and quality; producing human‐like motion; grounding the gesture in the co‐occurring speech in interaction with other speakers, and in the environment; performing gesture evaluation; and integration of gesture synthesis into applications. We highlight recent approaches to tackling the various key challenges, as well as the limitations of these approaches, and point toward areas of future development.

show abstract

Section: Data‐driven Approachesmentioning

confidence: 99%

Section: Data‐driven Approachesmentioning

confidence: 99%

Section: Key Challenges Of Gesture Generationmentioning

confidence: 99%

Section: Data‐driven Approachesmentioning

confidence: 99%

See 2 more Smart Citations

A Comprehensive Review of Data‐Driven Co‐Speech Gesture Generation

Nyatsanga

Kucherenko²,

Ahuja³

et al. 2023

Computer Graphics Forum

View full text Add to dashboard Cite

show abstract

“…Compared to CNN and RNN based models, the transformer model [Vaswani et al 2017] is relatively less explored in the audio-driven motion synthesis. Saeed et al [2023] present a variational transformer for encoding style information, whereas they adopt recurrent networks to model motion generation from both speech and style. Valle-Pérez et al [2021] and Li et al [2021b] propose generative transformer approaches with normalizing flow for dancing motion synthesis from music.…”

Section: Related Workmentioning

confidence: 99%

Bodyformer: Semantics-guided 3D Body Gesture Synthesis with Transformer

et al. 2023

View full text Add to dashboard Cite

Automatic gesture synthesis from speech is a topic that has attracted researchers for applications in remote communication, video games and Metaverse. Learning the mapping between speech and 3D full-body gestures is difficult due to the stochastic nature of the problem and the lack of a rich cross-modal dataset that is needed for training. In this paper, we propose a novel transformer-based framework for automatic 3D body gesture synthesis from speech. To learn the stochastic nature of the body gesture during speech, we propose a variational transformer to effectively model a probabilistic distribution over gestures, which can produce diverse gestures during inference. Furthermore, we introduce a mode positional embedding layer to capture the different motion speeds in different speaking modes. To cope with the scarcity of data, we design an intra-modal pre-training scheme that can learn the complex mapping between the speech and the 3D gesture from a limited amount of data. Our system is trained with either the Trinity speech-gesture dataset or the Talking With Hands 16.2M dataset. The results show that our system can produce more realistic, appropriate, and diverse body gestures compared to existing state-of-the-art approaches.

show abstract