AvatarCLIP

Hong, Fangzhou; Zhang, Mingyuan; Pan, Liang; Cai, Zhongang; Yang, Lei; Liu, Ziwei

doi:10.1145/3528223.3530094

Cited by 88 publications

(29 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition to image-to-text conversion, Hong et al designed a zero-shot text-driven 3D avatar generation and animation, named Avatar Contrastive Language-Image Pre-Training (AvatarCLIP). [110] As shown in Figure 3e, AvatarCLIP can create a customized 3D avatar following the users' expected shape and texture and make the avatar follow the predefined motions using desired text. Specifically, the generated 3D human geometry is initialized from shapes driven by natural language descriptions through a Variational Autoencoder (VAE) network.…”

Section: Advanced Image Sensorsmentioning

confidence: 99%

Recent Advances in Artificial Intelligence Sensors

Zhang

Wang

Lee

2023

Advanced Sensor Research

View full text Add to dashboard Cite

Significant growth in the development and deployment of artificial intelligence (AI) is being witnessed. Driven by the great versatility of emerging computer science and material science, various AI sensors provide cost‐effective approaches for a wide range of monitoring applications toward the realization of smart homes and personal healthcare. Advanced AI sensors have multiple sensors capable of detecting multidimensional information and human‐brain‐like computation device for data processing. Herein, this review outlines the recent advances in the development of AI sensors. This review first introduces the materials, fabrication methods, and algorithms of current AI sensors and their applications, i.e., complementary metal oxide semiconductor image sensors for computer vision, microelectromechanical systems, microphone sensors for voice recognition, and wearable sensors for gesture recognition. Then, the recent advances in AI wearables sensors and self‐powered sensor systems are highlighted. Next, the current developments of neuromorphic computing systems, multimodality, and digital twins are reviewed. Last, a perspective on future directions for further research development is also provided. In summary, the trend of advanced AI sensors is the complementary between edge computing and cloud computing, which will show great potential in the applications of smart buildings, individual healthcare, the Internet of things, etc.

show abstract

Section: Advanced Image Sensorsmentioning

confidence: 99%

Recent Advances in Artificial Intelligence Sensors

Zhang

Wang

Lee

2023

Advanced Sensor Research

View full text Add to dashboard Cite

show abstract

“…Diffusion Generative models have achieved impressive success in a wide variety of computer vision tasks such as image inpainting [31], text-to-image generation [30], and image-to-image translation [4]. Given the strong capability to bridge the large gap between highly uncertain and determinate distribution, several works have utilized the diffusion generative model for the text-to-motion generation [43,33,42]. Zhang et al [42] propose a versatile motiongeneration framework that incorporates a diffusion model to generate diverse motions from comprehensive texts.…”

Section: Diffusion Generative Modelsmentioning

confidence: 99%

“…Given the strong capability to bridge the large gap between highly uncertain and determinate distribution, several works have utilized the diffusion generative model for the text-to-motion generation [43,33,42]. Zhang et al [42] propose a versatile motiongeneration framework that incorporates a diffusion model to generate diverse motions from comprehensive texts. Similarly, Tevet et al [33] introduce a lightweight transformerbased diffusion generative model that can achieve text-tomotion and motion editing.…”

Section: Diffusion Generative Modelsmentioning

confidence: 99%

DDT: A Diffusion-Driven Transformer-based Framework for Human Mesh Recovery from a Video

Zheng¹,

Qi²,

Chen³

2023

Preprint

View full text Add to dashboard Cite

Human mesh recovery (HMR) provides rich human body information for various real-world applications such as gaming, human-computer interaction, and virtual reality. Compared to single image-based methods, video-based methods can utilize temporal information to further improve performance by incorporating human body motion priors. However, many-to-many approaches such as VIBE [17] suffer from motion smoothness and temporal inconsistency. While many-to-one approaches such as TCMR [3] and MPS-Net [36] rely on the future frames, which is non-causal and time inefficient during inference. To address these challenges, a novel Diffusion-Driven Transformer-based framework (DDT) for video-based HMR is presented. DDT is designed to decode specific motion patterns from the input sequence, enhancing motion smoothness and temporal consistency. As a many-to-many approach, the decoder of our DDT outputs the human mesh of all the frames, making DDT more viable for real-world applications where time efficiency is crucial and a causal model is desired. Extensive experiments are conducted on the widely used datasets (Hu-man3.6M [12], MPI-INF-3DHP [27], and 3DPW [35]), which demonstrated the effectiveness and efficiency of our DDT. The code will be publicly available.

show abstract

“…In contrast, our work uses a transformer architecture to learn temporal correlations over sequences of shapes. Transformers for generating sequences of human bodies has also been recently explored by Song et al [SWJ*22] who concentrate on a multi‐person skeleton generation use case and by Hong et al [HZP*22] for the generation of human body animations from text input. The recent work by Petrovich et al [PBV21] is closest in spirit to ours: it introduces Actor , a transformer variational autoencoder for action‐conditioned generation of human body poses.…”

Section: Related Workmentioning

confidence: 99%

Facial Animation with Disentangled Identity and Motion using Transformers

Chandran

Zoss

Groß

et al. 2022

Computer Graphics Forum

View full text Add to dashboard Cite

We propose a 3D+time framework for modeling dynamic sequences of 3D facial shapes, representing realistic non‐rigid motion during a performance. Our work extends neural 3D morphable models by learning a motion manifold using a transformer architecture. More specifically, we derive a novel transformer‐based autoencoder that can model and synthesize 3D geometry sequences of arbitrary length. This transformer naturally determines frame‐to‐frame correlations required to represent the motion manifold, via the internal self‐attention mechanism. Furthermore, our method disentangles the constant facial identity from the time‐varying facial expressions in a performance, using two separate codes to represent neutral identity and the performance itself within separate latent subspaces. Thus, the model represents identity‐agnostic performances that can be paired with an arbitrary new identity code and fed through our new identity‐modulated performance decoder; the result is a sequence of 3D meshes for the performance with the desired identity and temporal length. We demonstrate how our disentangled motion model has natural applications in performance synthesis, performance retargeting, key‐frame interpolation and completion of missing data, performance denoising and retiming, and other potential applications that include full 3D body modeling.

show abstract

AvatarCLIP

Cited by 88 publications

References 46 publications

Recent Advances in Artificial Intelligence Sensors

Recent Advances in Artificial Intelligence Sensors

DDT: A Diffusion-Driven Transformer-based Framework for Human Mesh Recovery from a Video

Facial Animation with Disentangled Identity and Motion using Transformers

Contact Info

Product

Resources

About