Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation

Hampali, Shreyas; Sarkar, Supriya; Rad, Mahdi; Lepetit, Vincent

doi:10.1109/cvpr52688.2022.01081

Cited by 79 publications

(46 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, we refine the hand and object poses through graph convolutional blocks equipped with the proposed mutual attention layer. We show that our method does not require iterative optimization as in [48,13], and the dense vertex-level mutual attention can model the hand-object interaction more effectively than sparse keypoints based methods [11,8]. In summary, our contributions are as follows.…”

Section: Introductionmentioning

confidence: 95%

“…In [41] a self-attention mechanism is used to capture feature dependencies for either the hand or the object and the interaction between them is modeled by the exchange of global features. Most close to our work is [11] where a cross-attention is used to model the correlation between the hand and the object. However, all above methods only model a sparse interaction between a pre-defined set of keypoints or features from the hand and the object, regardless of the fact that hand-object interaction actually occurs on physical regions of the surfaces.…”

Section: Introductionmentioning

confidence: 99%

“…While the former methods [48,13,10] generalize to diverse object classes, the optimization process requires multiple iterations to converge, which is not applicable for real-time applications like XR. In contrast, learning-based methods [26,14,12,8,11] can achieve real-time inference. Motivated by the optimization-based methods, soft contact losses are introduced [14,12] to implicitly guide the network to pursuit plausible hand-object interaction.…”

Section: Introductionmentioning

confidence: 99%

“…For a more effective modeling, other works focus on explicitly learning the hand-object correlation [8,6] in the design of the network. Recently, several attention-based works [41,11] are proposed considering its efficacy in modelling complex correlation. In [41] a self-attention mechanism is used to capture feature dependencies for either the hand or the object and the interaction between them is modeled by the exchange of global features.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Interacting Hand-Object Pose Estimation via Dense Mutual Attention

Wang¹,

Mao²,

Li³

2022

Preprint

View full text Add to dashboard Cite

3D hand-object pose estimation is the key to the success of many computer vision applications. The main focus of this task is to effectively model the interaction between the hand and an object. To this end, existing works either rely on interaction constraints in a computationally-expensive iterative optimization, or consider only a sparse correlation between sampled hand and object keypoints. In contrast, we propose a novel dense mutual attention mechanism that is able to model fine-grained dependencies between the hand and the object. Specifically, we first construct the hand and object graphs according to their mesh structures. For each hand node, we aggregate features from every object node by the learned attention and vice versa for each object node. Thanks to such dense mutual attention, our method is able to produce physically plausible poses with high quality and real-time inference speed. Extensive quantitative and qualitative experiments on large benchmark datasets show that our method outperforms state-of-the-art methods. The code is available at https://github.com/ rongakowang/DenseMutualAttention.git.

show abstract

Section: Introductionmentioning

confidence: 95%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Interacting Hand-Object Pose Estimation via Dense Mutual Attention

Wang¹,

Mao²,

Li³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…More recent works leverage the increasing capacity of computer vision to collect human hand poses when interacting with the object. HO3D [22,55] computes the ground truth 3D hand pose for images from 2D hand keypoint annotations. The method resolves ambiguities by considering physics constraints in hand-object interactions and hand-hand interactions.…”

Section: Dexterous Grasp Datasetsmentioning

confidence: 99%

DexGraspNet: A Large-Scale Robotic Dexterous Grasp Dataset for General Objects Based on Simulation

Wang¹,

Zhang²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Object grasping using dexterous hands is a crucial yet challenging task for robotic dexterous manipulation. Compared with the field of object grasping with parallel grippers, dexterous grasping is very under-explored, partially owing to the lack of a large-scale dataset. In this work, we present a large-scale simulated dataset, DexGraspNet, for robotic dexterous grasping, along with a highly efficient synthesis method for diverse dexterous grasping synthesis. Leveraging a highly accelerated differentiable force closure estimator, we, for the first time, are able to synthesize stable and diverse grasps efficiently and robustly. We choose ShadowHand, a dexterous gripper commonly seen in robotics, and generated 1.32 million grasps for 5355 objects, covering more than 133 object categories and containing more than 200 diverse grasps for each object instance, with all grasps having been validated by the physics simulator. Compared to the previous dataset generated by GraspIt!, our dataset has not only more objects and grasps, but also higher diversity and quality. Via performing crossdataset experiments, we show that training several algorithms of dexterous grasp synthesis on our datasets significantly outperforms training on the previous one, demonstrating the large scale and diversity of DexGraspNet. We will release the data and tools upon acceptance.

show abstract

HandDGCL: Two‐hand 3D reconstruction based disturbing graph contrastive learning

Han

Yao

Wang

et al. 2023

Computer Animation & Virtual

View full text Add to dashboard Cite

Virtual reality (VR) and augmented reality (AR) applications are becoming increasingly prevalent. However, constructing realistic 3D hands, especially when two hands are interacting, from a single RGB image remains a major challenge due to severe mutual occlusion and the enormous diversity of hand poses. In this article, we propose a disturbing graph contrastive learning strategy for two‐hand 3D reconstruction. This involves a graph disturbance network designed to generate graph feature pairs to enhance the consistency of the two‐hand pose features. A contrastive learning module leverages high‐quality generative features for a strong feature expression. We further propose a similarity distinguish method to divide positive and negative features for accelerating the model convergence. Additionally, a multi‐term loss is designed to balance the relation among the hand pose, the visual scale and the viewpoint position. Our model has achieved state‐of‐the‐art results in the InterHand2.6M benchmark. Ablation studies show the model's great ability to correct unreasonable hand movements. In subjective assessments, our graph disturbance learning method significantly improves the construction of realistic 3D hands, especially when two hands are interacting.

show abstract

Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation

Cited by 79 publications

References 33 publications

Interacting Hand-Object Pose Estimation via Dense Mutual Attention

Interacting Hand-Object Pose Estimation via Dense Mutual Attention

DexGraspNet: A Large-Scale Robotic Dexterous Grasp Dataset for General Objects Based on Simulation

HandDGCL: Two‐hand 3D reconstruction based disturbing graph contrastive learning

Contact Info

Product

Resources

About