“…A common approach with the former representation is to "lift" 2D keypoints (either ground truth or from a 2D pose detector) to 3D. This has been recently done with neural networks [28,57,31] and previously using a dictionary of 3D skeletons [38,2,59,54] or other priors [47,50,2] to constrain the problem. The point cloud representation also allows one to train a CNN to regress directly from an image (instead of 2D keypoints) to 3D joints using supervision from motion capture datasets like Human 3.6M [35,41,34].…”