ChallenCap: Monocular 3D Capture of Challenging Human Performances using Multi-Modal References

He, Yannan; Pang, Anqi; Chen, Xin; Liang, Han; Wu, Minye; Ma, Yuexin; Xu, Lan

doi:10.1109/cvpr46437.2021.01124

Cited by 26 publications

(9 citation statements)

References 77 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Besides, TCMR [114] applies two more GRUs to forecast additional temporal features for the current target pose from the past and future frames. Based on GRUs and a motion discriminator, ChallenCap [198] utilizes multi-modal and hybrid motion references to capture challenging human motions. Lee et al [147] consider the uncertainty-aware embedding and include optical flow information.…”

Section: Recovery From Monocular Videosmentioning

confidence: 99%

“…MPoser, an extension of VPoser [22] to temporal sequences, is based on sequential VAE. Inspired by VIBE, He et al [198] generate marker-based motion maps as input to a discriminator to obtain an adversarial motion prior. In HuMoR [158], the probability distribution of possible state transitions is formulated by a conditional variational autoencoder (CVAE).…”

Section: Motion Priormentioning

confidence: 99%

See 1 more Smart Citation

Recovering 3D Human Mesh from Monocular Images: A Survey

Tian¹,

Zhang²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Estimating human pose and shape from monocular images is a long-standing problem in computer vision. Since the release of statistical body models, 3D human mesh recovery has been drawing broader attention. With the same goal of obtaining well-aligned and physically plausible mesh results, two paradigms have been developed to overcome challenges in the 2D-to-3D lifting process: i) an optimization-based paradigm, where different data terms and regularization terms are exploited as optimization objectives; and ii) a regression-based paradigm, where deep learning techniques are embraced to solve the problem in an end-to-end fashion. Meanwhile, continuous efforts are devoted to improving the quality of 3D mesh labels for a wide range of datasets. Though remarkable progress has been achieved in the past decade, the task is still challenging due to flexible body motions, diverse appearances, complex environments, and insufficient in-the-wild annotations. To the best of our knowledge, this is the first survey to focus on the task of monocular 3D human mesh recovery. We start with the introduction of body models and then elaborate recovery frameworks and training objectives by providing in-depth analyses of their strengths and weaknesses. We also summarize datasets, evaluation metrics, and benchmark results. Open issues and future directions are discussed in the end, hoping to motivate researchers and facilitate their research in this area. A regularly updated project page can be found at https://github.com/tinatiansjz/hmr-survey.

show abstract

Section: Recovery From Monocular Videosmentioning

confidence: 99%

Section: Motion Priormentioning

confidence: 99%

Recovering 3D Human Mesh from Monocular Images: A Survey

Tian¹,

Zhang²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…There are many real-time motion capture solutions available and we adopt the recent single camera technique [He et al 2021] for convenience. It is able to detect 21 key points of skeletons.…”

Section: D Vs 3d Renderingmentioning

confidence: 99%

“…In this specific case, a user should not only be able to omnidirectionally watch the virtual trainer's moves but also compare their own moves with the trainer. In our implementation, we use a single camera motion capture solution [He et al 2021] that estimates 3D skeleton structures of users as they move. We also precompute the "ground truth" skeleton moves of the trainer, by first rendering a multiview video of whole body movements also using NeuVV and then conducting multi-view skeleton estimation.…”

Section: Ground Truthmentioning

confidence: 99%

NeuVV: Neural Volumetric Videos with Immersive Rendering and Editing

Zhang¹,

Wang²,

Liu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

“…To capture humans in challenging poses and motion, ChallenCap [7] uses a learning-and-optimization framework, which learns motion characteristics and is trained on "a new challenging human motion dataset with both unsynchronized marker-based and light-weight multiimage references" [7]. First, a noisy skeletal motion map is acquired, this is then processed in their temporal encoder-decoder and generation network HybridNet and their motion discriminator, which is followed by a robust motion optimization phase, which uses 2D keypoint and silhouette information from the video frames.…”

Section: Fitting a Body Model To Image Sequencesmentioning

confidence: 99%

Imposing Temporal Consistency on Deep Monocular Body Shape and Pose Estimation

Zimmer¹,

Hilsmann²,

Morgenstern³

et al. 2022

Preprint

View full text Add to dashboard Cite

Accurate and temporally consistent modeling of human bodies is essential for a wide range of applications, including character animation, understanding human social behavior and AR/VR interfaces. Capturing human motion accurately from a monocular image sequence is still challenging and the modeling quality is strongly influenced by the temporal consistency of the captured body motion. Our work presents an elegant solution for the integration of temporal constraints in the fitting process. This does not only increase temporal consistency but also robustness during the optimization. In detail, we derive parameters of a sequence of body models, representing shape and motion of a person, including jaw poses, facial expressions, and finger poses. We optimize these parameters over the complete image sequence, fitting one consistent body shape while imposing temporal consistency on the body motion, assuming linear body joint trajectories over a short time. Our approach enables the derivation of realistic 3D body models from image sequences, including facial expression and articulated hands. In extensive experiments, we show that our approach results in accurately estimated body shape and motion, also for challenging movements and poses. Further, we apply it to the special application of sign language analysis, where accurate and temporal consistent motion modelling is essential, and show that the approach is well-suited for this kind of application.

show abstract

ChallenCap: Monocular 3D Capture of Challenging Human Performances using Multi-Modal References

Cited by 26 publications

References 77 publications

Recovering 3D Human Mesh from Monocular Images: A Survey

Recovering 3D Human Mesh from Monocular Images: A Survey

NeuVV: Neural Volumetric Videos with Immersive Rendering and Editing

Imposing Temporal Consistency on Deep Monocular Body Shape and Pose Estimation

Contact Info

Product

Resources

About