Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras

Elhayek, Ahmed; Aguiar, Edilson de; Jain, Arjun; Tompson, Jonathan; Pishchulin, Leonid; Andriluka, Mykhaylo; Bregler, Christoph; Schiele, Bernt; Theobalt, Christian

doi:10.1109/cvpr.2015.7299005

Cited by 134 publications

(102 citation statements)

References 41 publications

(72 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Multi-view 3D human pose: Markerless motion capture has been investigated in computer vision for a decade. Early works on this problem aim to track the 3D skeleton or geometric model of human body through a multi-view sequence [38,43,11]. These tracking-based methods require initialization in the first frame and are prone to local optima and tracking failures.…”

Section: Related Workmentioning

confidence: 99%

Fast and Robust Multi-Person 3D Pose Estimation From Multiple Views

Dong

Wen

Huang

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

171

176

View full text Add to dashboard Cite

This paper addresses the problem of 3D pose estimation for multiple people in a few calibrated camera views. The main challenge of this problem is to find the cross-view correspondences among noisy and incomplete 2D pose predictions. Most previous methods address this challenge by directly reasoning in 3D using a pictorial structure model, which is inefficient due to the huge state space. We propose a fast and robust approach to solve this problem. Our key idea is to use a multi-way matching algorithm to cluster the detected 2D poses in all views. Each resulting cluster encodes 2D poses of the same person across different views and consistent correspondences across the keypoints, from which the 3D pose of each person can be effectively inferred. The proposed convex optimization based multi-way matching algorithm is efficient and robust against missing and false detections, without knowing the number of people in the scene. Moreover, we propose to combine geometric and appearance cues for cross-view matching. The proposed approach achieves significant performance gains from the state-of-the-art (96.3% vs. 90.6% and 96.9% vs. 88% on the Campus and Shelf datasets, respectively), while being efficient for real-time applications.

show abstract

Section: Related Workmentioning

confidence: 99%

Fast and Robust Multi-Person 3D Pose Estimation From Multiple Views

Dong

Wen

Huang

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

171

176

View full text Add to dashboard Cite

show abstract

“…For example, the most widely used system [2] needs multiple calibrated cameras with reflective markers carefully attached to the subjects' body. The actively-studied markerless approaches are also based on multi-view systems [18,26,16,22,23] or depth cameras [46,7]. For this reason, the amount of available 3D motion data is extremely limited.…”

Section: Introductionmentioning

confidence: 99%

Monocular Total Capture: Posing Face, Body, and Hands in the Wild

Xiang

Joo

Sheikh

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

334

262

View full text Add to dashboard Cite

Figure 1: We present the first method to simultaneously capture the 3D total body motion of a target person from a monocular view input. For each example, (left) input image and (right) 3D total body motion capture results overlaid on the input. AbstractWe present the first method to capture the 3D total motion of a target person from a monocular view input. Given an image or a monocular video, our method reconstructs the motion from body, face, and fingers represented by a 3D deformable mesh model. We use an efficient representation called 3D Part Orientation Fields (POFs), to encode the 3D orientations of all body parts in the common 2D image space. POFs are predicted by a Fully Convolutional Network (FCN), along with the joint confidence maps. To train our network, we collect a new 3D human motion dataset capturing diverse total body motion of 40 subjects in a multiview system. We leverage a 3D deformable human model to reconstruct total body pose from the CNN outputs by exploiting the pose and shape prior in the model. We also present a texture-based tracking method to obtain temporally coherent motion capture output. We perform thorough quantitative evaluations including comparison with the existing body-specific and hand-specific methods, and performance analysis on camera viewpoint and human pose changes. Finally, we demonstrate the results of our total body motion capture on various challenging in-the-wild videos. Our code and newly collected human motion dataset will be publicly shared.

show abstract

“…Although there exists a commercial solution that uses marker-less multi-camera systems to obtain highly precise skeleton data at 120 frames per second (FPS) and approximately 25-50ms latency [99], computing depth maps is usually slow and often suffers from problems such as failures of correspondence search and noisy depth information. To address these problems, algorithms were also studied to construct human skeleton models directly from the multi-images without calculating the depth image [80,81,82]. For example, Gall et al [81] introduced an approach to fully-automatically estimate the 3D skeleton model from a multi-perspective video sequence, where an articulated template model and silhouettes are obtained from the sequence.…”

Section: Construction From Rgb Imagerymentioning

confidence: 99%

Space-time representation of people based on 3D skeletal data: A review

Han

Reily

Hoff

et al. 2017

Computer Vision and Image Understanding

273

192

View full text Add to dashboard Cite

Spatiotemporal human representation based on 3D visual perception data is a rapidly growing research area. Representations can be broadly categorized into two groups, depending on whether they use RGB-D information or 3D skeleton data. Recently, skeletonbased human representations have been intensively studied and kept attracting an increasing attention, due to their robustness to variations of viewpoint, human body scale and motion speed as well as the realtime, online performance. This paper presents a comprehensive survey of existing space-time representations of people based on 3D skeletal data, and provides an informative categorization and analysis of these methods from the perspectives, including information modality, representation encoding, structure and transition, and feature engineering. We also provide a brief overview of skeleton acquisition devices and construction methods, enlist a number of benchmark datasets with skeleton data, and discuss potential future research directions.

show abstract

Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras

Cited by 134 publications

References 41 publications

Fast and Robust Multi-Person 3D Pose Estimation From Multiple Views

Fast and Robust Multi-Person 3D Pose Estimation From Multiple Views

Monocular Total Capture: Posing Face, Body, and Hands in the Wild

Space-time representation of people based on 3D skeletal data: A review

Contact Info

Product

Resources

About