Learning Individual Styles of Conversational Gesture

Ginosar, Shiry; Bar, Amir; Kohavi, Gefen; Chan, Caroline; Owens, Andrew; Malik, Jitendra

doi:10.1109/cvpr.2019.00361

Cited by 241 publications

(279 citation statements)

References 45 publications

Supporting

Mentioning

275

Contrasting

Unclassified

Order By: Relevance

“…Audio-Driven Gesture Generation. Most prior work on datadriven gesture generation has used the audio-signal as the only speech-input modality in the model [14,15,19,28,42]. For example, Sadoughi and Busso [42] trained a probabilistic graphical model to generate a discrete set of gestures based on the speech audiosignal, using discourse functions as constraints.…”

Section: 21mentioning

confidence: 99%

“…Kucherenko et al [28] extended this work by applying representation learning to the human pose and reducing the need for smoothing. Recently, Ginosar et al [15] applied a convolutional neural network with adversarial training to generate 2D poses from spectrogram features. However, driving either virtual avatars or humanoid robots requires 3D joint angles.…”

Section: 21mentioning

confidence: 99%

“…Like previous research in gesture generation [13,15], we represent speech audio by log-power mel-spectrogram features. For this, we extracted 64-dimensional acoustic feature vectors using a window length of 0.1 s and hop length 0.05 s (giving 20 fps).…”

Section: Feature Typesmentioning

confidence: 99%

“…We assessed the perceived human-likeness of the virtual character's motion and how the motion related to the character's speech using measures adapted from recent co-speech gesture generation papers [15,48]. Specifically, we asked the questions "In which video... ": (Q1) "...are the character's movements most human-like?"…”

Section: Experiments Designmentioning

confidence: 99%

See 3 more Smart Citations

Gesticulator: A framework for semantically-aware speech-driven gesture generation

Kucherenko

Jonell

Waveren

et al. 2020

Proceedings of the 2020 International Conference on Multimodal Interaction

117

115

View full text Add to dashboard Cite

During speech, people spontaneously gesticulate, which plays a key role in conveying information. Similarly, realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current end-to-end co-speech gesture generation systems use a single modality for representing speech: either audio or text. These systems are therefore confined to producing either acoustically-linked beat gestures or semantically-linked gesticulation (e.g., raising a hand when saying "high"): they cannot appropriately learn to generate both gesture types. We present a model designed to produce arbitrary beat and semantic gestures together. Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output. The resulting gestures can be applied to both virtual agents and humanoid robots. Subjective and objective evaluations confirm the success of our approach. The code and video are available at the project page svito-zar.github.io/gesticulator.

show abstract

Section: 21mentioning

confidence: 99%

Section: 21mentioning

confidence: 99%

Section: Feature Typesmentioning

confidence: 99%

Section: Experiments Designmentioning

confidence: 99%

See 2 more Smart Citations

Gesticulator: A framework for semantically-aware speech-driven gesture generation

Kucherenko

Jonell

Waveren

et al. 2020

Proceedings of the 2020 International Conference on Multimodal Interaction

117

115

View full text Add to dashboard Cite

show abstract

“…Indeed, a recent breakthrough in deep-learning modeling suggests a highly invariant coupling of gesture and speech prosody. A recurrent neural network trained on person-specific gesture-speech sequences (motion and audio data from talk shows), was able to produce novel speech-synchronous gestures based on novel speech from the person the neural network was trained on (Ginosar et al, 2019). These neural networks are thus showing that there must be some person-specific invariant between speech acoustics and gesture motion, although it remains unknown what the neural network in fact picked up on in speech so as to produce gesture so well (but see Kucherenko, Hasegawa, Henter, Kaneko, & Kjellström, 2019).…”

mentioning

confidence: 99%

Energy flows in gesture-speech physics: Exploratory findings and pre-registration of confirmatory analysis

Pouw¹,

Sa²,

Ne³

et al. 2019

Preprint

View full text Add to dashboard Cite

A well-known phenomenon of multimodal language is the synchronous coupling of prosodic contours in speech with salient kinematic changes in co-speech hand-gesture motions. Invariably, such coupling has been rendered by psychologists to require a dedicated neural-cognitive mechanism preplanning speech and gesture trajectories. Recently, in a continuous vocalization task, it was found that acoustic peaks unintentionally appear in vocalizations when gesture motions reach peaks in physical impetus, suggesting a biomechanical basis for gesture-speech synchrony (Pouw, Harrison, & Dixon, 2019). However, from this rudimentary study it is still difficult to draw strong conclusions about gesture-speech dynamics in (more) complex speech and the precise biomechanical nature of these effects. Here we assess how the timing of physical impetus of a gesture relates to its effect on acoustic parameters of mono-syllabic consonant-vowel (CV) vocalization(/pa/). Furthermore, we assess how chest-wall kinematics is affected by gesturing, and whether this modulates the effect of gestures on acoustics. In the current exploratory analysis, we analyze a subset (N = 4) of an already collected dataset (N = 36), which serves as the basis for a pre-registration of the confirmatory analyses yet to be completed. Here we provide exploratory evidence that gestures affect acoustics (amplitude envelope and F0) as well as chest-wall kinematics during mono-syllabic vocalizations. These effects are more extreme when a gesture’s peak impetus occurs closer to the center of the vowel vocalization event. If the current findings can be replicated in confirmatory fashion, there is a more compelling case to be made that gesture-speech physics is important facet of multimodal synchrony.

show abstract

Automatic text‐to‐gesture rule generation for embodied conversational agents

Ghazanfar

Lee

Hwang

2020

Computer Animation & Virtual

View full text Add to dashboard Cite

Interactions with embodied conversational agents can be enhanced using human-like co-speech gestures. Traditionally, rule-based co-speech gesture mapping has been utilized for this purpose. However, the creation of this mapping is laborious and often requires human experts. Moreover, human-created mapping tends to be limited, therefore prone to generate repeated gestures. In this article, we present an approach to automate the generation of rule-based co-speech gesture mapping from publicly available large video data set without the intervention of human experts. At run-time, word embedding is utilized for rule searching to get the semantic-aware, meaningful, and accurate rule. The evaluation indicated that our method achieved comparable performance with the manual map generated by human experts, with a more variety of gestures activated. Moreover, synergy effects were observed in users' perception of generated co-speech gestures when combined with the manual map.

show abstract

Learning Individual Styles of Conversational Gesture

Cited by 241 publications

References 45 publications

Gesticulator: A framework for semantically-aware speech-driven gesture generation

Gesticulator: A framework for semantically-aware speech-driven gesture generation

Energy flows in gesture-speech physics: Exploratory findings and pre-registration of confirmatory analysis

Automatic text‐to‐gesture rule generation for embodied conversational agents

Contact Info

Product

Resources

About