2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.00361
|View full text |Cite
|
Sign up to set email alerts
|

Learning Individual Styles of Conversational Gesture

Abstract: Figure 1: Speech-to-gesture translation example. In this paper, we study the connection between conversational gesture and speech. Here, we show the result of our model that predicts gesture from audio. From the bottom upward: the input audio, arm and hand pose predicted by our model, and video frames synthesized from pose predictions using [10]. AbstractHuman speech is often accompanied by hand and arm gestures. Given audio speech input, we generate plausible gestures to go along with the sound. Specifically,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

3
275
0
1

Year Published

2019
2019
2021
2021

Publication Types

Select...
3
2
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 241 publications
(279 citation statements)
references
References 45 publications
3
275
0
1
Order By: Relevance
“…Audio-Driven Gesture Generation. Most prior work on datadriven gesture generation has used the audio-signal as the only speech-input modality in the model [14,15,19,28,42]. For example, Sadoughi and Busso [42] trained a probabilistic graphical model to generate a discrete set of gestures based on the speech audiosignal, using discourse functions as constraints.…”
Section: 21mentioning
confidence: 99%
See 3 more Smart Citations
“…Audio-Driven Gesture Generation. Most prior work on datadriven gesture generation has used the audio-signal as the only speech-input modality in the model [14,15,19,28,42]. For example, Sadoughi and Busso [42] trained a probabilistic graphical model to generate a discrete set of gestures based on the speech audiosignal, using discourse functions as constraints.…”
Section: 21mentioning
confidence: 99%
“…Kucherenko et al [28] extended this work by applying representation learning to the human pose and reducing the need for smoothing. Recently, Ginosar et al [15] applied a convolutional neural network with adversarial training to generate 2D poses from spectrogram features. However, driving either virtual avatars or humanoid robots requires 3D joint angles.…”
Section: 21mentioning
confidence: 99%
See 2 more Smart Citations
“…Indeed, a recent breakthrough in deep-learning modeling suggests a highly invariant coupling of gesture and speech prosody. A recurrent neural network trained on person-specific gesture-speech sequences (motion and audio data from talk shows), was able to produce novel speech-synchronous gestures based on novel speech from the person the neural network was trained on (Ginosar et al, 2019). These neural networks are thus showing that there must be some person-specific invariant between speech acoustics and gesture motion, although it remains unknown what the neural network in fact picked up on in speech so as to produce gesture so well (but see Kucherenko, Hasegawa, Henter, Kaneko, & Kjellström, 2019).…”
mentioning
confidence: 99%