Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation

Wan, Ziniu; Li, Zhengjia; Tian, Maoqing; Yi, Shuai; Li, Hongsheng

doi:10.1109/iccv48922.2021.01279

Cited by 66 publications

(27 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Lately, there has been a trend to adopt the multi-head self-attention (MHA) module [199] for long-term sequence dependency modeling [109], [135]. Wan et al [148] modify the original MHA to perform spatial and temporal encoding simultaneously.…”

Section: Recovery From Monocular Videosmentioning

confidence: 99%

“…Tripathi et al [200] use a sliding window to penalize 3D joints of the same frames before and after the window strides. Wan et al [148] use a series of learnable linear regressors to decode joint rotations in a hierarchical order. Some objective terms are predefined empirically or learned from large motion capture datasets [86], [92].…”

Section: Recovery From Monocular Videosmentioning

confidence: 99%

See 1 more Smart Citation

Recovering 3D Human Mesh from Monocular Images: A Survey

Tian¹,

Zhang²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Estimating human pose and shape from monocular images is a long-standing problem in computer vision. Since the release of statistical body models, 3D human mesh recovery has been drawing broader attention. With the same goal of obtaining well-aligned and physically plausible mesh results, two paradigms have been developed to overcome challenges in the 2D-to-3D lifting process: i) an optimization-based paradigm, where different data terms and regularization terms are exploited as optimization objectives; and ii) a regression-based paradigm, where deep learning techniques are embraced to solve the problem in an end-to-end fashion. Meanwhile, continuous efforts are devoted to improving the quality of 3D mesh labels for a wide range of datasets. Though remarkable progress has been achieved in the past decade, the task is still challenging due to flexible body motions, diverse appearances, complex environments, and insufficient in-the-wild annotations. To the best of our knowledge, this is the first survey to focus on the task of monocular 3D human mesh recovery. We start with the introduction of body models and then elaborate recovery frameworks and training objectives by providing in-depth analyses of their strengths and weaknesses. We also summarize datasets, evaluation metrics, and benchmark results. Open issues and future directions are discussed in the end, hoping to motivate researchers and facilitate their research in this area. A regularly updated project page can be found at https://github.com/tinatiansjz/hmr-survey.

show abstract

Section: Recovery From Monocular Videosmentioning

confidence: 99%

Section: Recovery From Monocular Videosmentioning

confidence: 99%

Recovering 3D Human Mesh from Monocular Images: A Survey

Tian¹,

Zhang²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…25,26 These network models contain huge amounts of parameters and so may be limited to be used for videos in some applications. Aiming at video processing, some recent works 27,28 exploit transformer modules to capture the temporal information from both the past frames and the future frames, proving the validity and efficiency for estimating human pose 27 and even shape. 28 Multi branches of network model have displayed superiorities to cope with domain-shift when transferring among different datasets.…”

Section: Transformer In Computer Visionmentioning

confidence: 99%

Parallel‐branch network for 3D human pose and shape estimation in video

Wang

2022

Computer Animation & Virtual

View full text Add to dashboard Cite

Human pose and shape estimation have developed rapidly, where a skinned multi-person linear (SMPL) approach performs excellent recently. However, the prior template of the human body in the SMPL model is fixed, thus a deviation may be resulted in the reconstructed body shape if a human body acts sharp movements such as sporting or dancing. To address this problem, we propose a parallel-branch network including a designed spatial-temporal (ST) branch and a SMPL branch. The ST branch essentially performs the 2D-to-3D lifting for more accurate joint prediction, by the designed spatial transformer and temporal transformer. The 3D joints from the ST branch are used to supervise the 3D joints from the SMPL branch and further correct the deviation of the SMPL model. Experiments on some popular benchmarks like 3DPW and MPI-INF-3DHP show that our method has better performance than other methods with video input. Our code is available at https://automation.seu.edu.cn/ wcx/list.htm

show abstract

“…Generally speaking, multi-frame pose estimation approaches [9,26,29,37,46,65,66] show advantages over single-frame ones. Specifically, some works apply temporal models (e.g., GRUs [9,29,37], TCNs [46,65], and Transformers [59,69]) for feature extraction, ensuring the pose estimators have continuous inputs on time sequences. Other methods employ regularizers or loss functions for smoothness [26,41,43,54,57,67] to constrain the temporal consistency across successive frames.…”

Section: The Jitter Problem From Pose Estimatorsmentioning

confidence: 99%