Weakly-Supervised 3D Hand Pose Estimation from Monocular RGB Images

Cai, Yi; Ge, Liuhao; Cai, Jianfei; Yuan, Junsong

doi:10.1007/978-3-030-01231-1_41

Cited by 257 publications

(345 citation statements)

References 41 publications

(90 reference statements)

Supporting

Mentioning

328

Contrasting

Order By: Relevance

“…Note that the StereoHands benchmark is close to saturation. In contrast to other methods [4,20,37,65,80] that only predicts sparse skeleton keypoints, our model produces a dense hand mesh. Figure A.1 presents some qualitative results from this dataset.…”

Section: A2 Mano Pose Representationmentioning

confidence: 97%

Learning Joint Reconstruction of Hands and Manipulated Objects

Hasson

Varol

Tzionas

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

408

595

View full text Add to dashboard Cite

Estimating hand-object manipulations is essential for interpreting and imitating human actions. Previous work has made significant progress towards reconstruction of hand poses and object shapes in isolation. Yet, reconstructing hands and objects during manipulation is a more challenging task due to significant occlusions of both the hand and object. While presenting challenges, manipulations may also simplify the problem since the physics of contact restricts the space of valid hand-object configurations. For example, during manipulation, the hand and object should be in contact but not interpenetrate. In this work, we regularize the joint reconstruction of hands and objects with manipulation constraints. We present an end-to-end learnable model that exploits a novel contact loss that favors physically plausible hand-object constellations. Our approach improves grasp quality metrics over baselines, using RGB images as input. To train and evaluate the model, we also propose a new large-scale synthetic dataset, ObMan, with hand-object manipulations. We demonstrate the transferability of ObMan-trained models to real data.

show abstract

Section: A2 Mano Pose Representationmentioning

confidence: 97%

Learning Joint Reconstruction of Hands and Manipulated Objects

Hasson

Varol

Tzionas

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

408

595

View full text Add to dashboard Cite

show abstract

“…The earlier work [68] attempts to learn the direct mapping from RGB images to 3D skeletons. Recent methods [4,14] have shown the state-of-the-art accuracy by implicitly reconstructing depth images i.e. 2.5D representations, and estimating the 3D skeletal based on them.…”

Section: Related Workmentioning

confidence: 99%

“…However, even under this setting, the problem still remains challenging as estimating a 3D mesh given an RGB image is a seriously ill-posed problem. Adopting recent human body pose estimation approaches [35,57], we further stratify learning of f DHPE : X → Y by decomposing f HME : X → V into a 2D evidence estimator f E2D : X → Z and a 3D mesh estimator f E3D : Z → V. Our 2D evidence z ∈ Z consists of a 42-dimensional 2D skeletal joint position vector j 2D (21 positions × 2; as in [4,14]) and a 2,048dimensional 2D feature vector F (x) (Eq. 2).…”

Section: Proposed Dense Hand Pose Estimatormentioning

confidence: 99%

“…1) is important as it helps understand e.g. human-object interactions [7,6,3,1] and perform robotic Discriminative methods based on convolutional neural networks (CNNs) have shown very promising performance in estimating 3D hand poses either from RGB images [43,68,4,14,29,46] or depth maps [65,30,50,58,30,64,28,38,64,2]. However, the predictions are based on coarse skeletal representations, and no explicit kinematics and geometric mesh constraints are often considered.…”

Section: Introductionmentioning

confidence: 99%

“…Recently, several methods were developed [68,34,4,14]. While directly lifting 2D estimations to 3D was attempted in [68], 2.5D depth maps are estimated as clues for 3D lifting in state-of-the-art techniques [4,14]. In this paper, we exploit a deformable 3D hand mesh model, which inherently offers a full description of both hand shapes and articulations, 3D priors for recovering depths, and self-data augmentation.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Pushing the Envelope for RGB-Based Dense 3D Hand Pose Estimation via Neural Rendering

Baek

Kim

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

199

173

View full text Add to dashboard Cite

Estimating 3D hand meshes from single RGB images is challenging, due to intrinsic 2D-3D mapping ambiguities and limited training data. We adopt a compact parametric 3D hand model that represents deformable and articulated hand meshes. To achieve the model fitting to RGB images, we investigate and contribute in three ways: 1) Neural rendering: inspired by recent work on human body, our hand mesh estimator (HME) is implemented by a neural network and a differentiable renderer, supervised by 2D segmentation masks and 3D skeletons. HME demonstrates good performance for estimating diverse hand shapes and improves pose estimation accuracies. 2) Iterative testing refinement: Our fitting function is differentiable. We iteratively refine the initial estimate using the gradients, in the spirit of iterative model fitting methods like ICP. The idea is supported by the latest research on human body. 3) Self-data augmentation: collecting sized RGB-mesh (or segmentation mask)-skeleton triplets for training is a big hurdle. Once the model is successfully fitted to input RGB images, its meshes i.e. shapes and articulations, are realistic, and we augment view-points on top of estimated dense hand poses. Experiments using three RGB-based benchmarks show that our framework offers beyond state-of-the-art accuracy in 3D pose estimation, as well as recovers dense 3D hand shapes. Each technical component above meaningfully improves the accuracy in the ablation study.

show abstract

Deformable Pose Traversal Convolution for 3D Action and Gesture Recognition

Weng

Liu

Jiang

et al. 2018

Computer Vision – ECCV 2018

Self Cite

View full text Add to dashboard Cite

The representation of 3D pose plays a critical role for 3D action and gesture recognition. Rather than representing a 3D pose directly by its joint locations, in this paper, we propose a Deformable Pose Traversal Convolution Network that applies one-dimensional convolution to traverse the 3D pose for its representation. Instead of fixing the receptive field when performing traversal convolution, it optimizes the convolution kernel for each joint, by considering contextual joints with various weights. This deformable convolution better utilizes the contextual joints for action and gesture recognition and is more robust to noisy joints. Moreover, by feeding the learned pose feature to a LSTM, we perform end-to-end training that jointly optimizes 3D pose representation and temporal sequence recognition. Experiments on three benchmark datasets validate the competitive performance of our proposed method, as well as its efficiency and robustness to handle noisy joints of pose.

show abstract

Weakly-Supervised 3D Hand Pose Estimation from Monocular RGB Images

Cited by 257 publications

References 41 publications

Learning Joint Reconstruction of Hands and Manipulated Objects

Learning Joint Reconstruction of Hands and Manipulated Objects

Pushing the Envelope for RGB-Based Dense 3D Hand Pose Estimation via Neural Rendering

Deformable Pose Traversal Convolution for 3D Action and Gesture Recognition

Contact Info

Product

Resources

About