2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018
DOI: 10.1109/cvpr.2018.00546
|View full text |Cite
|
Sign up to set email alerts
|

LSTM Pose Machines

Abstract: We observed that recent state-of-the-art results on single image human pose estimation were achieved by multistage Convolution Neural Networks (CNN). Notwithstanding the superior performance on static images, the application of these models on videos is not only computationally intensive, it also suffers from performance degeneration and flicking. Such suboptimal results are mainly attributed to the inability of imposing sequential geometric consistency, handling severe image quality degradation (e.g. motion b… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
120
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 129 publications
(127 citation statements)
references
References 35 publications
(88 reference statements)
2
120
0
Order By: Relevance
“…In [25], Song et al propose a Thin-Slicing network that uses dense optical flow to warp and align heatmaps of neighboring frames and then performs spatial-temporal inference via message passing through the graph constructed by joint candidates and their relationships among aligned heatmaps. [11] and [20] sequentially estimate human poses in videos following the Encoder-RNN-Decoder framework. Given a frame, this kind of framework first uses an encoder network to learn high-level image representations, then RNN units to explicitly propagate temporal information between neighboring frames and produce hidden states, and finally a decoder network to take hidden states as input and output pose estimation results of current frame.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…In [25], Song et al propose a Thin-Slicing network that uses dense optical flow to warp and align heatmaps of neighboring frames and then performs spatial-temporal inference via message passing through the graph constructed by joint candidates and their relationships among aligned heatmaps. [11] and [20] sequentially estimate human poses in videos following the Encoder-RNN-Decoder framework. Given a frame, this kind of framework first uses an encoder network to learn high-level image representations, then RNN units to explicitly propagate temporal information between neighboring frames and produce hidden states, and finally a decoder network to take hidden states as input and output pose estimation results of current frame.…”
Section: Related Workmentioning
confidence: 99%
“…Its distilled pose kernels can be applied to fast localize body joints with simple convolution, further improving the efficiency. In addition, it can directly leverage temporal cues of one frame to assist body joint localization in the following frame, without requiring auxiliary optical flow models [25] or decoders appended to RNN units [20]. It can also fast distill pose kernels in a one-shot manner, avoiding complex iterating utilized by previous online kernel learning models [4,27].…”
Section: Formulationmentioning
confidence: 99%
See 3 more Smart Citations