Monocular 3D Multi-Person Pose Estimation by Integrating Top-Down and Bottom-Up Networks

Cheng, Yu; Wang, Bo; Yang, Bo; Tan, Robby T.

doi:10.1109/cvpr46437.2021.00756

Cited by 41 publications

(23 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…3DCrowdNet [12] is a top-down method that proposed to concatenate image features and 2D pose heatmaps to exploit the 2D pose-guided features for better accuracy. We also involved 3D skeleton estimation approaches [7,48,59]: Moon et al [48]'s work that estimates the absolute root position and root-relative 3D skeletons focusing on camera distance. Cheng et al [7]'s work that integrates top-down method and bottom-up methods for estimating better 3D skeletons.…”

Section: Methodsmentioning

confidence: 99%

“…Recent multi-person 3D pose regression works [7,54,59,72] tackled a variety of issues such as developing attention-based mechanism dedicated to the 3D pose estimation problem which considers 3D-to-2D projection process [72], combining the top-down and bottom-up networks [7], developing the tracking-based for multi-person [54] and so on. Sárándi et al [59] recently proposed a metric-scale 3D pose estimation method that is robust to truncations.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Multi-Person 3D Pose and Shape Estimation via Inverse Kinematics and Refinement

Cha¹,

Saqlain²,

GeonU³

et al. 2022

Preprint

View full text Add to dashboard Cite

Estimating 3D poses and shapes in the form of meshes from monocular RGB images is challenging. Obviously, it is more difficult than estimating 3D poses only in the form of skeletons or heatmaps. When interacting persons are involved, the 3D mesh reconstruction becomes more challenging due to the ambiguity introduced by person-to-person occlusions. To tackle the challenges, we propose a coarse-to-fine pipeline that benefits from 1) inverse kinematics from the occlusion-robust 3D skeleton estimation and 2) Transformer-based relation-aware refinement techniques. In our pipeline, we first obtain occlusion-robust 3D skeletons for multiple persons from an RGB image. Then, we apply inverse kinematics to convert the estimated skeletons to deformable 3D mesh parameters. Finally, we apply the Transformer-based mesh refinement that refines the obtained mesh parameters considering intra-and inter-person relations of 3D meshes. Via extensive experiments, we demonstrate the effectiveness of our method, outperforming state-of-the-arts on 3DPW, MuPoTS and AGORA datasets.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Multi-Person 3D Pose and Shape Estimation via Inverse Kinematics and Refinement

Cha¹,

Saqlain²,

GeonU³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…This paper is based on our conference paper [63]. Unlike our conference version, however, we add test time optimization to handle the gap between training and testing data in Section 3.5, which is critical for our method to process unseen videos.…”

Section: Taskmentioning

confidence: 99%

Dual Networks Based 3D Multi-Person Pose Estimation From Monocular Video

Cheng

Wang²,

Tan

2023

IEEE Trans. Pattern Anal. Mach. Intell.

Self Cite

View full text Add to dashboard Cite

Monocular 3D human pose estimation has made progress in recent years. Most of the methods focus on single persons, which estimate the poses in the person-centric coordinates, i.e., the coordinates based on the center of the target person. Hence, these methods are inapplicable for multi-person 3D pose estimation, where the absolute coordinates (e.g., the camera coordinates) are required. Moreover, multi-person pose estimation is more challenging than single pose estimation, due to inter-person occlusion and close human interactions. Existing top-down multi-person methods rely on human detection (i.e., top-down approach), and thus suffer from the detection errors and cannot produce reliable pose estimation in multi-person scenes. Meanwhile, existing bottom-up methods that do not use human detection are not affected by detection errors, but since they process all persons in a scene at once, they are prone to errors, particularly for persons in small scales. To address all these challenges, we propose the integration of top-down and bottom-up approaches to exploit their strengths. Our top-down network estimates human joints from all persons instead of one in an image patch, making it robust to possible erroneous bounding boxes. Our bottom-up network incorporates human-detection based normalized heatmaps, allowing the network to be more robust in handling scale variations. Finally, the estimated 3D poses from the top-down and bottom-up networks are fed into our integration network for final 3D poses. To address the common gaps between training and testing data, we do optimization during the test time, by refining the estimated 3D human poses using high-order temporal constraint, re-projection loss, and bone length regularizations. We also introduce a two-person pose discriminator that enforces natural two-person interactions. Finally, we apply a semi-supervised method to overcome the 3D ground-truth data scarcity. Our evaluations demonstrate the effectiveness of the proposed method and its individual components. Our code and pretrained models are available publicly: https://github.com/3dpose/3D-Multi-Person-Pose.

show abstract

“…To handle inherent depth ambiguity, Wang et al [39] proposed a novel hierarchical multi-person ordinal relation. Another type of work utilizes temporal information to recover 3D poses from a given video [7,8]. By applying a top-down scheme, the above method either directly regresses the absolute 3D depth from a cropped image, or it computes it based on a prior of the body size, ignoring global image contexts.…”

Section: Related Workmentioning

confidence: 99%

Mutual Adaptive Reasoning for Monocular 3D Multi-Person Pose Estimation

Zhang¹,

Wang²,

Shi³

et al. 2022

Preprint

View full text Add to dashboard Cite

Inter-person occlusion and depth ambiguity make estimating the 3D poses of monocular multiple persons as camera-centric coordinates a challenging problem. Typical top-down frameworks suffer from high computational redundancy with an additional detection stage. By contrast, the bottom-up methods enjoy low computational costs as they are less affected by the number of humans. However, most existing bottom-up methods treat camera-centric 3D human pose estimation as two unrelated subtasks: 2.5D pose estimation and camera-centric depth estimation. In this paper, we propose a unified model that leverages the mutual benefits of both these subtasks. Within the framework, a robust structured 2.5D pose estimation is designed to recognize inter-person occlusion based on depth relationships. Additionally, we develop an end-to-end geometry-aware depth reasoning method that exploits the mutual benefits of both 2.5D pose and camera-centric root depths. This method first uses 2.5D pose and geometry information to infer camera-centric root depths in a forward pass, and then exploits the root depths to further improve representation learning of 2.5D pose estimation in a backward pass. Further, we designed an adaptive fusion scheme that leverages both visual perception and body geometry to alleviate inherent depth ambiguity issues. Extensive experiments demonstrate the superiority of our proposed model over a wide range of bottom-up methods. Our accuracy is even competitive with top-down counterparts. Notably, our model runs much faster than existing bottom-up and top-down methods. CCS CONCEPTS• Computing methodologies → Activity recognition and understanding.

show abstract

Monocular 3D Multi-Person Pose Estimation by Integrating Top-Down and Bottom-Up Networks

Cited by 41 publications

References 38 publications

Multi-Person 3D Pose and Shape Estimation via Inverse Kinematics and Refinement

Multi-Person 3D Pose and Shape Estimation via Inverse Kinematics and Refinement

Dual Networks Based 3D Multi-Person Pose Estimation From Monocular Video

Mutual Adaptive Reasoning for Monocular 3D Multi-Person Pose Estimation

Contact Info

Product

Resources

About