Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents 2020
DOI: 10.1145/3383652.3423874
|View full text |Cite
|
Sign up to set email alerts
|

Generating coherent spontaneous speech and gesture from text

Abstract: Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscripted speech audio as the source material. On the motion side, probabilistic motion-generation methods can now synthe… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
2
2
1

Relationship

1
4

Authors

Journals

citations
Cited by 14 publications
(11 citation statements)
references
References 12 publications
(19 reference statements)
0
11
0
Order By: Relevance
“…A humanoid can build its kinematic space from the bottom up, by detecting gestures in interaction, construct a kinematic similarity space over time, and infer from the distance matrices which gestures are likely to be semantically related (given the assumption that kinematic space and semantic space tend to align). Moreover, the humanoid's own gesture generation process may be tailored such that there is some weak dependency between the kinematics of gestures that are related in content, thus optimizing its gesture behavior to cohere in a similar way as human gesture does [36][37][38]. The current findings thus provide an exciting proof-of-concept that continuous communicative bodily movements that co-vary in kinematic structure, also co-vary in meaning.…”
Section: Discussionmentioning
confidence: 76%
“…A humanoid can build its kinematic space from the bottom up, by detecting gestures in interaction, construct a kinematic similarity space over time, and infer from the distance matrices which gestures are likely to be semantically related (given the assumption that kinematic space and semantic space tend to align). Moreover, the humanoid's own gesture generation process may be tailored such that there is some weak dependency between the kinematics of gestures that are related in content, thus optimizing its gesture behavior to cohere in a similar way as human gesture does [36][37][38]. The current findings thus provide an exciting proof-of-concept that continuous communicative bodily movements that co-vary in kinematic structure, also co-vary in meaning.…”
Section: Discussionmentioning
confidence: 76%
“…However, the vast majority of these agents use an incoherent setup where the speech synthesis is trained on a different dataset than the gesture generation, and style and speaker identity may differ between components. In fact, the only system we are aware of where both components explicitly were trained on the same dataset is the one in [3], and we consequently use their approach as the baseline pipeline approach for our experiments in Section 5.…”
Section: Towards Integrated Multimodal Synthesismentioning
confidence: 99%
“…We only used the upper body data and removed the fingers (but not hand orientation) due to low capture accuracy there. For visualization, we instead used fixed, lightly-cupped hands on the avatar, similar to [3].…”
Section: Data 41 Training Corpusmentioning
confidence: 99%
See 2 more Smart Citations