Using a Single RGB Frame for Real Time 3D Hand Pose Estimation in the Wild

Panteleris, Paschalis; Oikonomidis, Iason; Argyros, Antonis A.

doi:10.1109/wacv.2018.00054

Cited by 202 publications

(157 citation statements)

References 74 publications

Supporting

Mentioning

148

Contrasting

Unclassified

Order By: Relevance

“…The overall hand pose estimation accuracy is measured in the area under the curve (AUC) and the ratio of correct keypoints (PCK) with varying thresholds for each [68,4,14]. For comparison, we adopt seven hand pose estimation algorithms including five neural networks (CNNs)-based algorithms ( [4,68] for RHD, [14,29] for DO, and [29,68,46] for SHD) and two 3D model fitting-based algorithms [34,19].…”

Section: Methodsmentioning

confidence: 99%

Pushing the Envelope for RGB-Based Dense 3D Hand Pose Estimation via Neural Rendering

Baek

Kim

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

204

173

View full text Add to dashboard Cite

Estimating 3D hand meshes from single RGB images is challenging, due to intrinsic 2D-3D mapping ambiguities and limited training data. We adopt a compact parametric 3D hand model that represents deformable and articulated hand meshes. To achieve the model fitting to RGB images, we investigate and contribute in three ways: 1) Neural rendering: inspired by recent work on human body, our hand mesh estimator (HME) is implemented by a neural network and a differentiable renderer, supervised by 2D segmentation masks and 3D skeletons. HME demonstrates good performance for estimating diverse hand shapes and improves pose estimation accuracies. 2) Iterative testing refinement: Our fitting function is differentiable. We iteratively refine the initial estimate using the gradients, in the spirit of iterative model fitting methods like ICP. The idea is supported by the latest research on human body. 3) Self-data augmentation: collecting sized RGB-mesh (or segmentation mask)-skeleton triplets for training is a big hurdle. Once the model is successfully fitted to input RGB images, its meshes i.e. shapes and articulations, are realistic, and we augment view-points on top of estimated dense hand poses. Experiments using three RGB-based benchmarks show that our framework offers beyond state-of-the-art accuracy in 3D pose estimation, as well as recovers dense 3D hand shapes. Each technical component above meaningfully improves the accuracy in the ablation study.

show abstract

Section: Methodsmentioning

confidence: 99%

Pushing the Envelope for RGB-Based Dense 3D Hand Pose Estimation via Neural Rendering

Baek

Kim

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

204

173

View full text Add to dashboard Cite

show abstract

“…The availability of commodity RGB-D sensors [25,48,59] led to significant progress in estimating 3D hand pose given depth or RGB-D input [17,24,39,40]. Recently, the community has shifted its focus to RGB-based methods [20,37,45,60,80]. To overcome the lack of 3D annotated data, many methods employed synthetic training images [9,33,37,38,80].…”

Section: Related Workmentioning

confidence: 99%

“…Unlike methods that predict only skeletons, our focus is to output a dense hand mesh to be able to infer interactions with objects. Very recently, Panteleris et al [45] and Malik et al [33] produce full hand meshes. However, [45] achieves this as a post-processing step by fitting to 2D predictions.…”

Section: Related Workmentioning

confidence: 99%

“…the energy consumption and environmental constrains such as distance to the target and exposure to sunlight. Recent work obtains promising results for 2D and 3D hand pose estimation from monocular RGB images using convolutional neural networks [9,20,37,45,60,61,80]. Most of this work, however, targets sparse keypoint estimation which is not sufficient for reasoning about hand-object contact.…”

Section: Introductionmentioning

confidence: 99%

“…Most of this work, however, targets sparse keypoint estimation which is not sufficient for reasoning about hand-object contact. Full 3D hand meshes are sometimes estimated from images by fitting a hand mesh to detected joints [45] or by tracking given a good initialization [8]. Recently, the 3D shape or surface of a hand using an end-to-end learnable model has been addressed with depth input [33].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Learning Joint Reconstruction of Hands and Manipulated Objects

Hasson

Varol

Tzionas

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

423

595

View full text Add to dashboard Cite

Estimating hand-object manipulations is essential for interpreting and imitating human actions. Previous work has made significant progress towards reconstruction of hand poses and object shapes in isolation. Yet, reconstructing hands and objects during manipulation is a more challenging task due to significant occlusions of both the hand and object. While presenting challenges, manipulations may also simplify the problem since the physics of contact restricts the space of valid hand-object configurations. For example, during manipulation, the hand and object should be in contact but not interpenetrate. In this work, we regularize the joint reconstruction of hands and objects with manipulation constraints. We present an end-to-end learnable model that exploits a novel contact loss that favors physically plausible hand-object constellations. Our approach improves grasp quality metrics over baselines, using RGB images as input. To train and evaluate the model, we also propose a new large-scale synthetic dataset, ObMan, with hand-object manipulations. We demonstrate the transferability of ObMan-trained models to real data.

show abstract

Weakly-Supervised 3D Hand Pose Estimation from Monocular RGB Images

Cai

et al. 2018

Lecture Notes in Computer Science

261

328

View full text Add to dashboard Cite

Compared with depth-based 3D hand pose estimation, it is more challenging to infer 3D hand pose from monocular RGB images, due to substantial depth ambiguity and the difficulty of obtaining fullyannotated training data. Different from existing learning-based monocular RGB-input approaches that require accurate 3D annotations for training, we propose to leverage the depth images that can be easily obtained from commodity RGB-D cameras during training, while during testing we take only RGB inputs for 3D joint predictions. In this way, we alleviate the burden of the costly 3D annotations in real-world dataset. Particularly, we propose a weakly-supervised method, adaptating from fully-annotated synthetic dataset to weakly-labeled real-world dataset with the aid of a depth regularizer, which generates depth maps from predicted 3D pose and serves as weak supervision for 3D pose regression. Extensive experiments on benchmark datasets validate the effectiveness of the proposed depth regularizer in both weakly-supervised and fullysupervised settings.

show abstract

Using a Single RGB Frame for Real Time 3D Hand Pose Estimation in the Wild

Cited by 202 publications

References 74 publications

Pushing the Envelope for RGB-Based Dense 3D Hand Pose Estimation via Neural Rendering

Pushing the Envelope for RGB-Based Dense 3D Hand Pose Estimation via Neural Rendering

Learning Joint Reconstruction of Hands and Manipulated Objects

Weakly-Supervised 3D Hand Pose Estimation from Monocular RGB Images

Contact Info

Product

Resources

About