3D human pose estimation in video with temporal convolutions and semi-supervised training

Pavllo, Dario; Feichtenhofer, Christoph; Grangier, David; Auli, Michael

doi:10.48550/arxiv.1811.11742

Cited by 7 publications

(24 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table 2 further shows the results after rigid alignment with the ground truth. Under this protocol, we also show results that are on par with the existing state-of-the-art [10,29]. It is interesting to note the significant improvements of non RNN-based frameworks ( [10,29] and ours) over the RNN-based framework [14].…”

Section: Methodsmentioning

confidence: 57%

“…We refer to this as protocol 1. Several works [4,10,12,14,18,21,24,27,28,29,32,42] also report the error after aligning further with respect to the ground truth pose via Procrustes Analysis. We refer to this as protocol 2.…”

Section: Methodsmentioning

confidence: 99%

“…We follow [29] in using 2d detections from the Cascaded Pyramid Network (CPN) [7] as our network input. Instead of down-sampling, which is commonly done in many works that use a single-frame input, we keep the original frame rate (50fps for Human3.6M and 25fps for MPI-INF-3DHP) because having access to the complete sequence provides more detailed information.…”

Section: Methodsmentioning

confidence: 99%

“…However, RNNs are sensitive to erroneous inputs and tend to drift over long sequences. To overcome the shortcomings of RNNs, a CNN-based framework is proposed by Pavllo et al [29] to aggregate temporal information using dilated convolutions. Despite being successful at regressing a single frame from a sequence of input, it cannot concurrently output the 3d pose estimations for all frames in the sequence.…”

Section: Related Workmentioning

confidence: 99%

“…However, concurrently estimating all frames in a long sequence is an arduous task for data-driven approaches due to the increasing dimensionality of the output space with longer sequences, and the network for "many-to-many mapping" requires significantly more data to train. As a result, the state-of-the-art temporal framework for 3d pose estimation [29] only outputs the estimate of a single frame centered on an input sequence of a few hundred frames.…”

mentioning

confidence: 99%

See 4 more Smart Citations

Trajectory Space Factorization for Deep Video-Based 3D Human Pose Estimation

Lin,

Lee

2019

Preprint

View full text Add to dashboard Cite

Existing deep learning approaches on 3d human pose estimation for videos are either based on Recurrent or Convolutional Neural Networks (RNNs or CNNs). However, RNN-based frameworks can only tackle sequences with limited frames because sequential models are sensitive to bad frames and tend to drift over long sequences. Although existing CNN-based temporal frameworks attempt to address the sensitivity and drift problems by concurrently processing all input frames in the sequence, the existing stateof-the-art CNN-based framework is limited to 3d pose estimation of a single frame from a sequential input. In this paper, we propose a deep learning-based framework that utilizes matrix factorization for sequential 3d human poses estimation. Our approach processes all input frames concurrently to avoid the sensitivity and drift problems, and yet outputs the 3d pose estimates for every frame in the input sequence. More specifically, the 3d poses in all frames are represented as a motion matrix factorized into a trajectory bases matrix and a trajectory coefficient matrix. The trajectory bases matrix is precomputed from matrix factorization approaches such as Singular Value Decomposition (SVD) or Discrete Cosine Transform (DCT), and the problem of sequential 3d pose estimation is reduced to training a deep network to regress the trajectory coefficient matrix. We demonstrate the effectiveness of our framework on long sequences by achieving state-ofthe-art performances on multiple benchmark datasets. Our source code is available at: https://github.com/jiahaoLjh/trajectory-pose-3d.

show abstract

Section: Methodsmentioning

confidence: 57%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

mentioning

confidence: 99%

See 3 more Smart Citations

Trajectory Space Factorization for Deep Video-Based 3D Human Pose Estimation

Lin,

Lee

2019

Preprint

View full text Add to dashboard Cite

show abstract

A review of 3D human pose estimation algorithms for markerless motion capture

Desmarais

Mottet

Slangen

et al. 2021

Computer Vision and Image Understanding

110

View full text Add to dashboard Cite

Freezing of gait assessment with inertial measurement units and deep learning: effect of tasks, medication states, and stops

Yang

Filtjens

Ginis

et al. 2023

Preprint

View full text Add to dashboard Cite

Background: Freezing of gait (FOG) is an episodic and highly disabling symptom of Parkinson's Disease (PD). Traditionally, FOG assessment relies on time-consuming visual inspection of camera footage. Therefore, previous studies have proposed portable and automated solutions to annotate FOG. However, automated FOG assessment is challenging due to gait variability caused by medication effects and varying FOG-provoking tasks. Moreover, whether automated approaches can differentiate FOG from typical everyday movements, such as volitional stops, remains to be determined. To address these questions, we evaluated an automated FOG assessment model with deep learning (DL) based on inertial measurement units (IMUs). We assessed its performance trained on all standardized FOG-provoking tasks and medication states, as well as on specific tasks and medication states. Furthermore, we examined the effect of adding stopping periods on FOG detection performance. Methods: Twelve PD patients with self-reported FOG (mean age 69.33 +/- 6.28 years) completed a FOG-provoking protocol, including timed-up-and-go and 360-degree turning-in-place tasks in On/Off dopaminergic medication states with/without volitional stopping. IMUs were attached to the pelvis and both sides of the tibia and talus. A multi-stage temporal convolutional network was developed to detect FOG episodes. FOG severity was quantified by the percentage of time frozen (%TF) and the number of freezing episodes (#FOG). The agreement between the model-generated outcomes and the gold standard experts' video annotation was assessed by the intra-class correlation coefficient (ICC). Results: For FOG assessment in trials without stopping, the agreement of our model was strong (ICC(%TF) = 0.92 [0.68, 0.98]; ICC(#FOG) = 0.95 [0.72, 0.99]). Models trained on a specific FOG-provoking task could not generalize to unseen tasks, while models trained on a specific medication state could generalize to unseen states. For assessment in trials with stopping, the model trained on stopping trials made fewer false positives than the model trained without stopping (ICC(%TF) = 0.95 [0.73, 0.99]; ICC(#FOG) = 0.79 [0.46, 0.94]). Conclusion: A DL model trained on IMU signals allows valid FOG assessment in trials with/without stops containing different medication states and FOG-provoking tasks. These results are encouraging and enable future work investigating automated FOG assessment during everyday life.

show abstract

3D human pose estimation in video with temporal convolutions and semi-supervised training

Cited by 7 publications

References 0 publications

Trajectory Space Factorization for Deep Video-Based 3D Human Pose Estimation

Trajectory Space Factorization for Deep Video-Based 3D Human Pose Estimation

A review of 3D human pose estimation algorithms for markerless motion capture

Freezing of gait assessment with inertial measurement units and deep learning: effect of tasks, medication states, and stops

Contact Info

Product

Resources

About