2021
DOI: 10.48550/arxiv.2103.10455
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

3D Human Pose Estimation with Spatial and Temporal Transformers

Abstract: Transformer architectures have become the model of choice in natural language processing and are now being introduced into computer vision tasks such as image classification, object detection, and semantic segmentation. However, in the field of human pose estimation, convolutional architectures still remain dominant. In this work, we present PoseFormer, a purely transformer-based approach for 3D human pose estimation in videos without convolutional architectures involved. Inspired by recent developments in vis… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
13
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(13 citation statements)
references
References 34 publications
0
13
0
Order By: Relevance
“…Thus, transformers are especially good at modelling long-range dependencies between elements of a sequence. Since then, there have been several attempts to adapt transformers towards vision tasks including object detection [2,56], image classification [8,41,52,46,14], segmentation [44], multiple object tracking [37,29], human pose estimation [50,55], point cloud processing [12,54], video processing [10,31,38], image super-resolution [30,49,3], image synthesis [9], etc. An extensive review is out of the scope of this paper.…”
Section: Transformers and Vision Transformersmentioning
confidence: 99%
“…Thus, transformers are especially good at modelling long-range dependencies between elements of a sequence. Since then, there have been several attempts to adapt transformers towards vision tasks including object detection [2,56], image classification [8,41,52,46,14], segmentation [44], multiple object tracking [37,29], human pose estimation [50,55], point cloud processing [12,54], video processing [10,31,38], image super-resolution [30,49,3], image synthesis [9], etc. An extensive review is out of the scope of this paper.…”
Section: Transformers and Vision Transformersmentioning
confidence: 99%
“…Human Pose Estimation. Recent several works [28,16,19,40,44,29] introduce Transformer for human pose es-…”
Section: Vision Transformermentioning
confidence: 99%
“…Vision Transformer Recently, several studies demonstrated that the transformer architectures [46] plays a significant role in a wide range of computer vision tasks, such as image classification [4,11,45], object detection [2,64], and semantic segmentation [48,55,63]. Recently, some studies also explored applying the transformer on human pose estimation tasks [28,29,31,56,62]. More specifically, for 2D pose estimation, TransPose [56] aims to explain the spatial dependencies of the predicted keypoints with transformers, PRTR [28] and TF-Pose [31] attempt to directly regress the joint coordinates by transformer decoders.…”
Section: Transformermentioning
confidence: 99%
“…More specifically, for 2D pose estimation, TransPose [56] aims to explain the spatial dependencies of the predicted keypoints with transformers, PRTR [28] and TF-Pose [31] attempt to directly regress the joint coordinates by transformer decoders. While for the 3D pose estimation, METRO [29] firstly applies transformer to reconstruct 3D human pose and mesh from a single image, and PoseFormer [62] builds a spatial-temporal transformers with the input of 2D joint sequences for 3D pose estimation in videos. However, previous works have hardly exploited the transformer architectures on the multi-view 3D pose estimation setting, which is however an important task in the pose estimation area.…”
Section: Transformermentioning
confidence: 99%