BEAT: A Large-Scale Semantic and Emotional Multi-modal Dataset for Conversational Gestures Synthesis

Liu, Haiyang; Zhu, Zihao; Naoya, Iwamoto,; Peng, Yichen; Li, Zhengqing; Zhou, You; Bozkurt, Elif; Zheng, Bo

doi:10.1007/978-3-031-20071-7_36

Cited by 52 publications

(43 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…7.1.1 Data. We train and test our system on two high-quality speech-gesture datasets: ZeroEGGS [Ghorbani et al 2022] and BEAT [Liu et al 2022e]. The ZeroEGGS dataset contains two hours of full-body motion capture and audio from monologues performed by an English-speaking female actor in 19 different styles.…”

Section: System Setupmentioning

confidence: 99%

“…At the time of writing this work, the authors of CaMN [Liu et al 2022e] have not provided the pre-trained generation model. Instead, they offered training codes for a toy dataset and a pre-trained motion auto-encoder for the calculation of FGD.…”

Section: B Implementation Details Of Baselinesmentioning

confidence: 99%

“…The label-based systems are typically trained on motion data with paired style labels. They allow editing of predefined styles, such as speaker identities [Ahuja et al 2020;Yoon et al 2020], emotions [Liu et al 2022e], and fine-grained styles like specific hand positions [Alexanderson et al 2020]. However, the capacity of such systems is limited by the number and granularity of the style labels, while obtaining such style labels is costly.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents

Tenglong¹,

Zhang²,

Liu³

2023

Preprint

View full text Add to dashboard Cite

show abstract

Section: System Setupmentioning

confidence: 99%

Section: B Implementation Details Of Baselinesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents

Tenglong¹,

Zhang²,

Liu³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Since the existing 3D hand prediction dataset is noisy due to the automated annotation process, we propose a prototypical memory bank to store the realistic hand prototype representations encoded from real 3D hands. These 3D hands are captured from a studio-based mocap-captured dataset, named BEAT [22]. The reading and updating strategies of hands prototypical memory are the same as TMM.…”

Section: Stage Two: Diverse Samplingmentioning

confidence: 99%

Diverse 3D Hand Gesture Prediction from Body Dynamics by Bilateral Hand Disentanglement

Qi¹,

Liu²,

Sun³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…In recent years, the compelling performance of deep neural networks has prompted datadriven approaches. Previous studies establish large-scale speech-gesture corpus to learn the mapping from speech audio to human skeletons in an end-to-end manner [4,5,25,27,30,34,39]. To attain more expressive results, Ginosar et al [16] and Yoon et al [41] propose GAN-based methods to guarantee realism by adversarial mechanism, where the discriminator is trained to distinguish real gestures from the synthetic ones while the generator's objective is to fool the discriminator.…”

Section: Introductionmentioning

confidence: 99%

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

Zhu¹,

Liu²,

Liu³

et al. 2023

Preprint

View full text Add to dashboard Cite

Animating virtual avatars to make co-speech gestures facilitates various applications in human-machine interaction. The existing methods mainly rely on generative adversarial networks (GANs), which typically suffer from notorious mode collapse and unstable training, thus making it difficult to learn accurate audio-gesture joint distributions. In this work, we propose a novel diffusion-based framework, named Diffusion Co-Speech Gesture (DiffGesture), to effectively capture the cross-modal audio-to-gesture associations and preserve temporal coherence for high-fidelity audio-driven co-speech gesture generation. Specifically, we first establish the diffusion-conditional generation process on clips of skeleton sequences and audio to enable the whole framework. Then, a novel Diffusion Audio-Gesture Transformer is devised to better attend to the information from multiple modalities and model the long-term temporal dependency. Moreover, to eliminate temporal inconsistency, we propose an effective Diffusion Gesture Stabilizer with an annealed noise sampling strategy. Benefiting from the architectural advantages of diffusion models, we further incorporate implicit classifier-free guidance to trade off between diversity and gesture quality. Extensive experiments demonstrate that DiffGesture achieves state-of-theart performance, which renders coherent gestures with better mode coverage and stronger audio correlations. Code is available at https://github.com/Advocate99/DiffGesture.

show abstract

BEAT: A Large-Scale Semantic and Emotional Multi-modal Dataset for Conversational Gestures Synthesis

Cited by 52 publications

References 47 publications

GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents

GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents

Diverse 3D Hand Gesture Prediction from Body Dynamics by Bilateral Hand Disentanglement

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

Contact Info

Product

Resources

About