Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots

Yoon, Youngwoo; Ko, Woo-Ri; Jang, Minsu; Lee, Jaeyeon; Kim, Jae Hong; Lee, Geehyuk

doi:10.1109/icra.2019.8793720

Cited by 171 publications

(198 citation statements)

References 19 publications

Supporting

Mentioning

194

Contrasting

Unclassified

Order By: Relevance

“…Ishi et al [22] generated gestures from text input through a series of probabilistic functions: Words were mapped to word concepts using WordNet [34], which then were mapped to a gesture function (e.g., iconic or beat), which in turn were mapped to clusters of 3D hand gestures. Yoon et al [48] learned a mapping from the utterance text to gestures using a recurrent neural network. The produced gestures were aligned with audio in a post-processing step.…”

Section: 22mentioning

confidence: 99%

“…Instead, they rely on postprocessing to increase smoothness as in [19]. Yoon et al [48] include a velocity penalty in training that discourages jerky motion. The recurrent connections used in several models [13,19,48] can also act as a pose memory that may help the model to produce smooth output motion.…”

Section: Regarding Motion Continuitymentioning

confidence: 99%

“…Yoon et al [48] include a velocity penalty in training that discourages jerky motion. The recurrent connections used in several models [13,19,48] can also act as a pose memory that may help the model to produce smooth output motion. Autoregressive motion models have recently demonstrated promising results in probabilistic audio-driven gesture generation [2].…”

Section: Regarding Motion Continuitymentioning

confidence: 99%

“…Those approaches are constrained by the discrete set of gestures they can produce. Alongside recent advances in deep learning, data-driven approaches have increasingly gained interest for gesture generation [1,27,48]. While early work has considered gesture generation as a classification task which aims to deduce a specified gesture class [9,37], more recent work has considered it as a regression task which aims to produce continuous motion [2,48].…”

mentioning

confidence: 99%

See 3 more Smart Citations

Gesticulator: A framework for semantically-aware speech-driven gesture generation

Kucherenko

Jonell

Waveren

et al. 2020

Proceedings of the 2020 International Conference on Multimodal Interaction

129

120

View full text Add to dashboard Cite

During speech, people spontaneously gesticulate, which plays a key role in conveying information. Similarly, realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current end-to-end co-speech gesture generation systems use a single modality for representing speech: either audio or text. These systems are therefore confined to producing either acoustically-linked beat gestures or semantically-linked gesticulation (e.g., raising a hand when saying "high"): they cannot appropriately learn to generate both gesture types. We present a model designed to produce arbitrary beat and semantic gestures together. Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output. The resulting gestures can be applied to both virtual agents and humanoid robots. Subjective and objective evaluations confirm the success of our approach. The code and video are available at the project page svito-zar.github.io/gesticulator.

show abstract

Section: 22mentioning

confidence: 99%

Section: Regarding Motion Continuitymentioning

confidence: 99%

Section: Regarding Motion Continuitymentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Gesticulator: A framework for semantically-aware speech-driven gesture generation

Kucherenko

Jonell

Waveren

et al. 2020

Proceedings of the 2020 International Conference on Multimodal Interaction

129

120

View full text Add to dashboard Cite

show abstract

“…Lastly, the learned model can be applied to a humanoid robot so that the robot's speech is accompanied by appropriate co-speech gestures, for instance on the NAO robot as in [39].…”

Section: Future Workmentioning

confidence: 99%

Analyzing Input and Output Representations for Speech-Driven Gesture Generation

Kucherenko

Hasegawa

Henter

et al. 2019

Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents

115

View full text Add to dashboard Cite

This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates.Our approach consists of two steps. First, we learn a lowerdimensional representation of human motion using a denoising autoencoder neural network, consisting of a motion encoder Mo-tionE and a motion decoder MotionD. The learned representation preserves the most important aspects of the human pose variation while removing less relevant variation. Second, we train a novel encoder network SpeechE to map from speech to a corresponding motion representation with reduced dimensionality. At test time, the speech encoder and the motion decoder networks are combined: SpeechE predicts motion representations based on a given speech signal and MotionD then decodes these representations to produce motion sequences.We evaluate different representation sizes in order to find the most effective dimensionality for the representation. We also evaluate the effects of using different speech features as input to the model. We find that mel-frequency cepstral coefficients (MFCCs), alone or combined with prosodic features, perform the best. The results of a subsequent user study confirm the benefits of the representation learning.

show abstract

Automatic text‐to‐gesture rule generation for embodied conversational agents

Ghazanfar

Lee

Hwang

2020

Computer Animation & Virtual

View full text Add to dashboard Cite

Interactions with embodied conversational agents can be enhanced using human-like co-speech gestures. Traditionally, rule-based co-speech gesture mapping has been utilized for this purpose. However, the creation of this mapping is laborious and often requires human experts. Moreover, human-created mapping tends to be limited, therefore prone to generate repeated gestures. In this article, we present an approach to automate the generation of rule-based co-speech gesture mapping from publicly available large video data set without the intervention of human experts. At run-time, word embedding is utilized for rule searching to get the semantic-aware, meaningful, and accurate rule. The evaluation indicated that our method achieved comparable performance with the manual map generated by human experts, with a more variety of gestures activated. Moreover, synergy effects were observed in users' perception of generated co-speech gestures when combined with the manual map.

show abstract

Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots

Cited by 171 publications

References 19 publications

Gesticulator: A framework for semantically-aware speech-driven gesture generation

Gesticulator: A framework for semantically-aware speech-driven gesture generation

Analyzing Input and Output Representations for Speech-Driven Gesture Generation

Automatic text‐to‐gesture rule generation for embodied conversational agents

Contact Info

Product

Resources

About