Co-speech gestures enhance interaction experiences between humans as well as between humans and robots. Existing robots use rule-based speech-gesture association, but this requires human labor and prior knowledge of experts to be implemented. We present a learning-based co-speech gesture generation that is learned from 52 h of TED talks. The proposed end-to-end neural network model consists of an encoder for speech text understanding and a decoder to generate a sequence of gestures. The model successfully produces various gestures including iconic, metaphoric, deictic, and beat gestures. In a subjective evaluation, participants reported that the gestures were human-like and matched the speech content. We also demonstrate a co-speech gesture with a NAO robot working in real time.
For human-like agents, including virtual avatars and social robots, making proper gestures while speaking is crucial in human-agent interaction. Co-speech gestures enhance interaction experiences and make the agents look alive. However, it is difficult to generate human-like gestures due to the lack of understanding of how people gesture. Data-driven approaches attempt to learn gesticulation skills from human demonstrations, but the ambiguous and individual nature of gestures hinders learning. In this paper, we present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures. By incorporating a multimodal context and an adversarial training scheme, the proposed model outputs gestures that are human-like and that match with speech content and rhythm. We also introduce a new quantitative evaluation metric for gesture generation models. Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models. We further confirm that our model is able to work with synthesized audio in a scenario where contexts are constrained, and show that different gesture styles can be generated for the same speech by specifying different speaker identities in the style embedding space that is learned from videos of various speakers. All the code and data is available at https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context.
Poly[o(m,p)-phenylenevinylene-alt-2-methoxy-5-(2-ethylhexyloxy)-p-phenylenevinylene], o(m,p)-PMEH-PPV, and poly[o(m,p)-phenylenevinylene-alt-2,5-bis(trimethylsilyl)-p-phenylenevinylene], o(m,p)-PBTMS-PPV, of varying effective conjugation lengths were synthesized by the well-known Wittig
condensation polymerization between the appropriate diphosphonium salts and the dialdehyde monomers
such as terephthaldicarboxaldehyde, isophthalaldehyde, and phthalicdicarboxaldehyde. The conjugation
lengths of the polymers were controlled by substituents and kink (ortho and meta) linkages. The resulting
polymers were highly soluble in common organic solvents. The synthesized polymers showed UV−visible
absorbance and photoluminescence (PL) in the ranges of 330−430 nm and 440−550 nm, respectively.
The maximum emission peak of p-PMEH-PPV was blueshifted about 30 nm compared to that of MEH-PPV, which is due to an unsubstituted phenylene unit. In addition, o-PMEH-PPV and m-PMEH-PPV
showed PL emission maximum peaks at 500 and 490 nm, respectively, because ortho and meta linkage
of the o(m)-PMEH-PPV reduced π-conjugation lengths. The trimethylsilyl substituent has no electrondonating effect; therefore, the PL maximum of p-PBTMS-PPV was far more blueshifted (to 485 nm).
Consequently, maximum PL wavelengths for o-PBTMS-PPV and m-PBTMS-PPV appeared around 470
and 440 nm, respectively. A single-layer light-emitting diode device was fabricated which has a simple
ITO (indium−tin oxide)/polymer/Al configuration. The threshold bias of trimethylsilyl-substituted o(m,p)-PBTMS-PPV was in the range of 8−9 V. As in the photoluminescence spectra, the dramatic change of
emission color was also shown in electroluminescence spectra of p-PMEH-PPV, p-PBTMS-PPV, and
o-PBTMS-PPV when the operating voltage was about 8−9 V.
Two new fully conjugated alternating copolymers containing both carbazole and oxadiazole
units were prepared through the Wittig condensation polymerization (carbazole units were linked with
oxadiazole units by meta and para). The polymers with the para linkage (PPOX−CAR) and the meta
linkage (PMOX−CAR) in the main chain were soluble in common organic solvents and thermally stable
on heating (the weight loss was less than 5% on heating to about 400 °C under nitrogen atmosphere).
The maximum photoluminescence and the electroluminescence wavelengths of PPOX−CAR and PMOX−CAR were varied from 495 nm in the greenish-blue emission region to 450 nm in the blue emission region
depending on the kink structure. The turn-on voltages of PPOX-CAR and PMOX-CAR were 7.5 and 10.5
V, respectively, when the single-layer light-emitting diodes of Al/PPOX-CAR or PMOX-CAR/ITO glass
were fabricated. The maximum brightness of the Al/PPOX-CAR/ITO single-layer device was 500 cd/m2
at 20 V.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.