Multi-Scale Structure-Aware Network for Human Pose Estimation

Ke, Lipeng; Chang, Ming-Ching; Qi, Honggang; Lyu, Siwei

doi:10.1007/978-3-030-01216-8_44

Cited by 194 publications

(150 citation statements)

References 21 publications

Supporting

Mentioning

142

Contrasting

Order By: Relevance

“…An upsample process can be used to gradually recover the high-resolution representations from the low-resolution representations. The upsample subnetwork could be a symmetric version of the downsample process (e.g., VGGNet), with skipping connection over some mirrored layers to transform the pooling indices, e.g., SegNet [3] and DeconvNet [85], or copying the feature maps, e.g., U-Net [95] and Hourglass [6], [7], [21], [24], [51], [83], [109], [131], [132], encoder-decoder [90], and so on. An extension of U-Net, full-resolution residual network [92], introduces an extra full-resolution stream that carries information at the full image resolution, to replace the skip connections, and each unit in the downsample and upsample subnetworks receives information from and sends information to the full-resolution stream.…”

Section: Related Workmentioning

confidence: 99%

Deep High-Resolution Representation Learning for Visual Recognition

Wang

Sun

Cheng

et al. 2021

IEEE Trans. Pattern Anal. Mach. Intell.

2,582

1,479

View full text Add to dashboard Cite

High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions in series (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams in parallel; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are available at https://github.com/HRNet. ! 1 INTRODUCTION D EEP convolutional neural networks (DCNNs) have achieved state-of-the-art results in many computer vision tasks, such as image classification, object detection, semantic segmentation, human pose estimation, and so on. The strength is that DCNNs are able to learn richer representations than conventional hand-crafted representations. Most recently-developed classification networks, including AlexNet [59], VGGNet [101], GoogleNet [108], ResNet [39], etc., follow the design rule of LeNet-5 [61]. This is depicted in Figure 1 (a): gradually reduce the spatial size of the feature maps, connect the convolutions from high resolution to low resolution in series, and lead to a low-resolution representation, which is further processed for classification.High-resolution representations are needed for positionsensitive tasks, e.g., semantic segmentation, human pose estimation, and object detection. The previous state-of-the-art methods adopt the high-resolution recovery process to raise the representation resolution from the low-resolution representation outputted by a classification or classification-like network as depicted in Figure 1 (b), e.g., Hourglass [83], Seg-Net [3], DeconvNet [85], U-Net [95], SimpleBaseline [124], and encoder-decoder [90]. In addition, dilated convolutions are used to remove some down-sample layers and thus yield medium-resolution representations [15], [144].We present a novel architecture, namely High-Resolution Net (HRNet), which is able to maintain high-resolution representations through the whole process. We start from a highresolution convolution stream, gradually add high-to-low resolution convolution streams one by one, and connect the multi-resolution streams in parallel. The resulting network • J. Wang is with Microsoft Research,

show abstract

Section: Related Workmentioning

confidence: 99%

Deep High-Resolution Representation Learning for Visual Recognition

Wang

Sun

Cheng

et al. 2021

IEEE Trans. Pattern Anal. Mach. Intell.

2,582

1,479

View full text Add to dashboard Cite

show abstract

“…Human pose estimation is a problem of localizing human body part locations in an input image. Most of the current works [34,10,45,46,28,42] use a deep convolutional neural network and generate the output as a 2D heatmap, which is encoded as a gaussian map centered at each body part location. Hourglass network [34] exploits the iterative refinements on the predictions from the repeated encoder-decoder architecture design to capture complex spatial relationships.…”

Section: Related Workmentioning

confidence: 99%

“…Even with deep ar-chitectures, disambiguating look-alike body parts remain as a main problem [39] in pose estimation community. Recent methods [46,11,28], built on top of the hourglass network, use multi-scale and body part structure information to improve the performance by adding more architectural components.…”

Section: Related Workmentioning

confidence: 99%

Anchor Loss: Modulating Loss Scale Based on Prediction Difficulty

Ryou

Jeong²,

Perona

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

We propose a novel loss function that dynamically rescales the cross entropy based on prediction difficulty regarding a sample. Deep neural network architectures in image classification tasks struggle to disambiguate visually similar objects. Likewise, in human pose estimation symmetric body parts often confuse the network with assigning indiscriminative scores to them. This is due to the output prediction, in which only the highest confidence label is selected without taking into consideration a measure of uncertainty. In this work, we define the prediction difficulty as a relative property coming from the confidence score gap between positive and negative labels. More precisely, the proposed loss function penalizes the network to avoid the score of a false prediction being significant. To demonstrate the efficacy of our loss function, we evaluate it on two different domains: image classification and human pose estimation. We find improvements in both applications by achieving higher accuracy compared to the baseline methods.

show abstract

“…The intermediate supervision at each hourglass module benefits from previous module outputs, refining and improving final network predictions. Given its high performance, its conceptual simplicity, and that allows for an easy multitask integration among stacked modules, this architecture is serving as a baseline model in several works [30], [31], [32], [33], [34].…”

Section: A Multi-task Architecturementioning

confidence: 99%

Multi-task human analysis in still images: 2D/3D pose, depth map, and multi-part segmentation

Sánchez

Oliu

Madadi

et al. 2019

2019 14th IEEE International Conference on Automatic Face &Amp; Gesture Recognition (FG 2019)

View full text Add to dashboard Cite

While many individual tasks in the domain of human analysis have recently received an accuracy boost from deep learning approaches, multi-task learning has mostly been ignored due to a lack of data. New synthetic datasets are being released, filling this gap with synthetic generated data. In this work, we analyze four related human analysis tasks in still images in a multi-task scenario by leveraging such datasets. Specifically, we study the correlation of 2D/3D pose estimation, body part segmentation and full-body depth estimation. These tasks are learned via the well-known Stacked Hourglass module such that each of the task-specific streams shares information with the others. The main goal is to analyze how training together these four related tasks can benefit each individual task for a better generalization. Results on the newly released SURREAL dataset show that all four tasks benefit from the multi-task approach, but with different combinations of tasks: while combining all four tasks improves 2D pose estimation the most, 2D pose improves neither 3D pose nor full-body depth estimation. On the other hand 2D parts segmentation can benefit from 2D pose but not from 3D pose. In all cases, as expected, the maximum improvement is achieved on those human body parts that show more variability in terms of spatial distribution, appearance and shape, e.g. wrists and ankles.

show abstract

Multi-Scale Structure-Aware Network for Human Pose Estimation

Cited by 194 publications

References 21 publications

Deep High-Resolution Representation Learning for Visual Recognition

Deep High-Resolution Representation Learning for Visual Recognition

Anchor Loss: Modulating Loss Scale Based on Prediction Difficulty

Multi-task human analysis in still images: 2D/3D pose, depth map, and multi-part segmentation

Contact Info

Product

Resources

About