Moving Fast and Slow: Analysis of Representations and Post-Processing in Speech-Driven Automatic Gesture Generation

Kucherenko, Taras; Hasegawa, Dai; Kaneko, Naoshi; Henter, Gustav Eje

doi:10.1080/10447318.2021.1883883

Cited by 32 publications

(21 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many data-driven systems have only considered a single speech modality -either audio recordings or text transcriptions thereofas input to the gesture generation, e.g., [3,25,39,54]. However, the field is now shifting to use both audio and text together [1,9,26,53].…”

Section: Effect Of the Speech Input Modalitymentioning

confidence: 99%

“…These prosodic features are commonly used in speech emotion analysis as well as for gesture property prediction, e.g., [56]. We normalised pitch and intensity like in [8,25]: the pitch values were adjusted by taking 𝑙𝑜𝑔(𝑥 + 1) − 4 and setting negative values to zero, and the intensity values were adjusted by taking 𝑙𝑜𝑔(𝑥) − 3. The audio features were first extracted at 200 fps and then resampled to 5 fps by averaging, to match the resolution of the gesture annotations.…”

Section: Speech Modalities and Their Encodingmentioning

confidence: 99%

See 1 more Smart Citation

Multimodal analysis of the predictability of hand-gesture properties

Kucherenko¹,

Nagy²,

Neff³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Embodied conversational agents benefit from being able to accompany their speech with gestures. Although many data-driven approaches to gesture generation have been proposed in recent years, it is still unclear whether such systems can consistently generate gestures that convey meaning. We investigate which gesture properties (phase, category, and semantics) can be predicted from speech text and/or audio using contemporary deep learning. In extensive experiments, we show that gesture properties related to gesture meaning (semantics and category) are predictable from text features (time-aligned BERT embeddings) alone, but not from prosodic audio features, while rhythm-related gesture properties (phase) on the other hand can be predicted from either audio, text (with word-level timing information), or both. These results are encouraging as they indicate that it is possible to equip an embodied agent with content-wise meaningful co-speech gestures using a machine-learning model.

show abstract

Section: Effect Of the Speech Input Modalitymentioning

confidence: 99%

Section: Speech Modalities and Their Encodingmentioning

confidence: 99%

Multimodal analysis of the predictability of hand-gesture properties

Kucherenko¹,

Nagy²,

Neff³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…The transfer of physical human movement to virtual avatars is certainly not a novel concept. Broadly speaking, human movement behavior has long been of interest to scholars examining natural mapping (Birk and Mandryk, 2013;Vanden Abeele et al, 2013), intelligent virtual agents (Gratch et al, 2002;Thiebaux et al, 2008;Marsella et al, 2013;Kucherenko et al, 2021), VR-based gesture tracking (Won et al, 2012;Christou and Michael, 2014), and pose estimation of anatomical keypoints (Andriluka et al, 2010;Pishchulin et al, 2012;Cao et al, 2017). Natural mapping motion capture systems, which generate virtual avatar representations based on physical human behavior, vary from 3D pose estimation [See (Wang et al, 2021) for a review] to facial expression sensors (Lugrin et al, 2016).…”

Section: Person-based Actionsmentioning

confidence: 99%

Protean Kinematics: A Blended Model of VR Physics

Jeong

Kim

et al. 2021

Front. Psychol.

View full text Add to dashboard Cite

Avatar research largely focuses on the effects of the appearance and external characteristics of avatars, but may also warrant further consideration of the effects of avatar movement characteristics. With Protean kinematics, we offer an expansion the avatar-user appearances-based effects of the Proteus Effect to a systematic exploration into the role of movement in affecting social perceptions (about others) and idealized perceptions (about self). This work presents both a theoretical (typology) and methodological (physics-based measurement) approach to understanding the complex blend of physical inputs and virtual outputs that occur in the perceptual experience of VR, particularly in consideration of the collection of hippocampal (e.g., place cells, grid cells) and entorhinal neurons (e.g., speed cells) that fire topologically relative to physical movement in physical space. Offered is a novel method that distills the blend of physical and virtual kinematics to contribute to modern understandings of human-agent interaction and cognitive psychology.

show abstract

“…Objective measures rely on an algorithmic approach to return a quantitative measure of the quality of the behaviour and are entirely automated, while subjective measures instead rely on ratings by human observers. Most recent papers on co-speech gesture generation report objective measures to assess the quality of the generated behaviour, with measures such as velocity diagrams or average jerk being popular [1,16,40]. These measures not only are easy to automate, but also allow comparisons across models.…”

Section: Introductionmentioning

confidence: 99%

To Rate or Not To Rate: Investigating Evaluation Methods for Generated Co-Speech Gestures

Wolfert

Girard

Kucherenko

et al. 2021

Proceedings of the 2021 International Conference on Multimodal Interaction

Self Cite

View full text Add to dashboard Cite

While automatic performance metrics are crucial for machine learning of artificial human-like behaviour, the gold standard for evaluation remains human judgement. The subjective evaluation of artificial human-like behaviour in embodied conversational agents is however expensive and little is known about the quality of the data it returns. Two approaches to subjective evaluation can be largely distinguished, one relying on ratings, the other on pairwise comparisons. In this study we use co-speech gestures to compare the two against each other and answer questions about their appropriateness for evaluation of artificial behaviour. We consider their ability to rate quality, but also aspects pertaining to the effort of use and the time required to collect subjective data. We use crowd sourcing to rate the quality of co-speech gestures in avatars, assessing which method picks up more detail in subjective assessments. We compared gestures generated by three different machine learning models with various level of behavioural quality. We found that both approaches were able to rank the videos according to quality and that the ranking significantly correlated, showing that in terms of quality there is no preference of one method over the other. We also found that pairwise comparisons were slightly faster and came with improved inter-rater reliability, suggesting that for small-scale studies pairwise comparisons are to be favoured over ratings. CCS CONCEPTS• Human-centered computing → HCI design and evaluation methods; Human computer interaction (HCI).

show abstract

Moving Fast and Slow: Analysis of Representations and Post-Processing in Speech-Driven Automatic Gesture Generation

Cited by 32 publications

References 40 publications

Multimodal analysis of the predictability of hand-gesture properties

Multimodal analysis of the predictability of hand-gesture properties

Protean Kinematics: A Blended Model of VR Physics

To Rate or Not To Rate: Investigating Evaluation Methods for Generated Co-Speech Gestures

Contact Info

Product

Resources

About