2023
DOI: 10.1111/cgf.14734
|View full text |Cite
|
Sign up to set email alerts
|

ZeroEGGS: Zero‐shot Example‐based Gesture Generation from Speech

Abstract: We present ZeroEGGS, a neural network framework for speech-driven gesture generation with zero-shot style control by example. This means style can be controlled via only a short example motion clip, even for motion styles unseen during training. Our model uses a Variational framework to learn a style embedding, making it easy to modify style through latent space manipulation or blending and scaling of style embeddings. The probabilistic nature of our framework further enables the generation of a variety of out… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
30
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 36 publications
(31 citation statements)
references
References 32 publications
0
30
0
Order By: Relevance
“…In a similar vein, Ghorbani et al [GFC22, GFH*23], used a VAE‐based framework for style controllable co‐speech gesture generation conditioned by a zero‐shot motion example i.e., an instance of a motion style unseen during training. Given an audio input and a motion example, they generated an encoding of the audio and a style embedding from the motion, and the two latent codes were used to guide the generation of stylized gestures.…”
Section: Data‐driven Approachesmentioning
confidence: 99%
See 3 more Smart Citations
“…In a similar vein, Ghorbani et al [GFC22, GFH*23], used a VAE‐based framework for style controllable co‐speech gesture generation conditioned by a zero‐shot motion example i.e., an instance of a motion style unseen during training. Given an audio input and a motion example, they generated an encoding of the audio and a style embedding from the motion, and the two latent codes were used to guide the generation of stylized gestures.…”
Section: Data‐driven Approachesmentioning
confidence: 99%
“…Style specification is also not data efficient, requiring as many samples as the size of the training set for the model to learn a style [AHKB20, ALNM20]. We conclude this section by discussing several works that proposed approaches for data‐efficient style specification [GFC22, GFH*23, FGPO22, ALM22].…”
Section: Data‐driven Approachesmentioning
confidence: 99%
See 2 more Smart Citations
“…Compared to CNN and RNN based models, the transformer model [Vaswani et al 2017] is relatively less explored in the audio-driven motion synthesis. Saeed et al [2023] present a variational transformer for encoding style information, whereas they adopt recurrent networks to model motion generation from both speech and style. Valle-Pérez et al [2021] and Li et al [2021b] propose generative transformer approaches with normalizing flow for dancing motion synthesis from music.…”
Section: Related Workmentioning
confidence: 99%