Synthesizing Talking Faces from Text and Audio: An Autoencoder and Sequence-to-Sequence Convolutional Neural Network

Liu, Na; Zhou, Tao; Ji, Yao; Zhao, Ziyi; Wan, Lihong

doi:10.1016/j.patcog.2020.107231

Cited by 16 publications

(3 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, many researchers have also started to investigate the use of deep convolutional neural networks to extract features for human pose estimation. In recent times, several technical solutions have achieved good performance [13]. A typical example is the OpenPose scheme based on deep convolutional neural networks developed by CMU.…”

Section: Introductionmentioning

confidence: 99%

Convolution-Based Design for Real-Time Pose Recognition and Character Animation Generation

Wang

Lee

2022

Wireless Communications and Mobile Computing

View full text Add to dashboard Cite

Human pose recognition and its generation are an important animation design key point. To this end, this paper designs new neural network structures for 2D and 3D pose extraction tasks and corresponding GPU-oriented acceleration schemes. The scheme first takes an image as input, extracts the human pose from it, converts it into an abstract pose data structure, and then uses the converted dataset as a basis to generate the desired character animation based on the input at runtime. The scheme in this paper has been tested on pose recognition datasets and different levels of hardware showing that 2D pose recognition can reach speeds above 60 fps on common computer hardware, 3D pose recognition can be estimated to reach speeds above 24 fps with an average error of only 110 mm, and real-time animation generation can reach speeds above 30 frames per second.

show abstract

Section: Introductionmentioning

confidence: 99%

Convolution-Based Design for Real-Time Pose Recognition and Character Animation Generation

Wang

Lee

2022

Wireless Communications and Mobile Computing

View full text Add to dashboard Cite

show abstract

“…Human emotions can be perceived not only through explicit facial expressions [1], voice information [2], or text cues [3], but also through implicit body language, including eye movements [4], body postures [5], and gait traits [6]. Nonverbal communication plays a major role in recent human-robot interaction (HRI) [7].…”

Section: Introductionmentioning

confidence: 99%

Data augmentation by separating identity and emotion representations for emotional gait recognition

Sheng

2023

Robotica

View full text Add to dashboard Cite

Human-centered intelligent human–robot interaction can transcend the traditional keyboard and mouse and have the capacity to understand human communicative intentions by actively mining implicit human clues (e.g., identity information and emotional information) to meet individuals’ needs. Gait is a unique biometric feature that can provide reliable information to recognize emotions even when viewed from a distance. However, the insufficient amount and diversity of training data annotated with emotions severely hinder the application of gait emotion recognition. In this paper, we propose an adversarial learning framework for emotional gait dataset augmentation, with which a two-stage model can be trained to generate a number of synthetic emotional samples by separating identity and emotion representations from gait trajectories. To our knowledge, this is the first work to realize the mutual transformation between natural gait and emotional gait. Experimental results reveal that the synthetic gait samples generated by the proposed networks are rich in emotional information. As a result, the emotion classifier trained on the augmented dataset is competitive with state-of-the-art gait emotion recognition works.

show abstract

“…Face animation synthesis has attracted increasing attention in academic and industrial fields, and is considered essential in the real-life applications of human-computer interaction, online teaching, film making, virtual reality, and computer games, among others [1,2,3]. Traditionally, facial synthesis in computer-generated imagery (CGI) has been performed using face capture methods.…”

Section: Introductionmentioning

confidence: 99%

Talking Head Generation Driven by Speech-Related Facial Action Units and Audio- Based on Multimodal Representation Fusion

Chen¹,

Liu²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Talking head generation is to synthesize a lip-synchronized talking head video by inputting an arbitrary face image and corresponding audio clips. Existing methods ignore not only the interaction and relationship of cross-modal information, but also the local driving information of the mouth muscles. In this study, we propose a novel generative framework that contains a dilated non-causal temporal convolutional self-attention network as a multimodal fusion module to promote the relationship learning of cross-modal features. In addition, our proposed method uses both audio-and speech-related facial action units (AUs) as driving information. Speech-related AU information can guide mouth movements more accurately. Because speech is highly correlated with speech-related AUs, we propose an audio-to-AU module to predict speech-related AU information. We utilize pre-trained AU classifier to ensure that the generated images contain correct AU information. We verify the effectiveness of the proposed model on the GRID and TCD-TIMIT datasets. An ablation study is also conducted to verify the contribution of each component. The results of quantitative and qualitative experiments demonstrate that our method outperforms existing methods in terms of both image quality and lip-sync accuracy.

show abstract

Synthesizing Talking Faces from Text and Audio: An Autoencoder and Sequence-to-Sequence Convolutional Neural Network

Cited by 16 publications

References 30 publications

Convolution-Based Design for Real-Time Pose Recognition and Character Animation Generation

Convolution-Based Design for Real-Time Pose Recognition and Character Animation Generation

Data augmentation by separating identity and emotion representations for emotional gait recognition

Talking Head Generation Driven by Speech-Related Facial Action Units and Audio- Based on Multimodal Representation Fusion

Contact Info

Product

Resources

About