In this paper, we are interested in the human pose estimation problem with a focus on learning reliable highresolution representations. Most existing methods recover high-resolution representations from low-resolution representations produced by a high-to-low resolution network. Instead, our proposed network maintains high-resolution representations through the whole process.We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutliresolution subnetworks in parallel. We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich highresolution representations. As a result, the predicted keypoint heatmap is potentially more accurate and spatially more precise. We empirically demonstrate the effectiveness of our network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset. In addition, we show the superiority of our network in pose tracking on the PoseTrack dataset. The code and models have been publicly available at https://github.com/leoxiaobin/ deep-high-resolution-net.pytorch.
High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions in series (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams in parallel; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are available at https://github.com/HRNet. ! 1 INTRODUCTION D EEP convolutional neural networks (DCNNs) have achieved state-of-the-art results in many computer vision tasks, such as image classification, object detection, semantic segmentation, human pose estimation, and so on. The strength is that DCNNs are able to learn richer representations than conventional hand-crafted representations. Most recently-developed classification networks, including AlexNet [59], VGGNet [101], GoogleNet [108], ResNet [39], etc., follow the design rule of LeNet-5 [61]. This is depicted in Figure 1 (a): gradually reduce the spatial size of the feature maps, connect the convolutions from high resolution to low resolution in series, and lead to a low-resolution representation, which is further processed for classification.High-resolution representations are needed for positionsensitive tasks, e.g., semantic segmentation, human pose estimation, and object detection. The previous state-of-the-art methods adopt the high-resolution recovery process to raise the representation resolution from the low-resolution representation outputted by a classification or classification-like network as depicted in Figure 1 (b), e.g., Hourglass [83], Seg-Net [3], DeconvNet [85], U-Net [95], SimpleBaseline [124], and encoder-decoder [90]. In addition, dilated convolutions are used to remove some down-sample layers and thus yield medium-resolution representations [15], [144].We present a novel architecture, namely High-Resolution Net (HRNet), which is able to maintain high-resolution representations through the whole process. We start from a highresolution convolution stream, gradually add high-to-low resolution convolution streams one by one, and connect the multi-resolution streams in parallel. The resulting network • J. Wang is with Microsoft Research,
We present a novel method for constructing Variational Autoencoder (VAE). Instead of using pixel-by-pixel loss, we enforce deep feature consistency between the input and the output of a VAE, which ensures the VAE's output to preserve the spatial correlation characteristics of the input, thus leading the output to have a more natural visual appearance and better perceptual quality. Based on recent deep learning works such as style transfer, we employ a pre-trained deep convolutional neural network (CNN) and use its hidden features to define a feature perceptual loss for VAE training. Evaluated on the CelebA face dataset, we show that our model produces better results than other methods in the literature. We also show that our method can produce latent vectors that can capture the semantic information of face expressions and can be used to achieve state-of-the-art performance in facial attribute prediction.
Space-time adaptive processing (STAP) is an effective tool for detecting a moving target in the airborne radar system. Due to the fast-changing clutter scenario and/or non side-looking configuration, the stationarity of the training data is destroyed such that the statistical-based methods suffer performance degradation. Direct data domain (D3) methods avoid non-stationary training data and can effectively suppress the clutter within the test cell. However, this benefit comes at the cost of a reduced system degree of freedom (DOF), which results in performance loss. In this paper, by exploiting the intrinsic sparsity of the spectral distribution, a new direct data domain approach using sparse representation (D3SR) is proposed, which seeks to estimate the high-resolution space-time spectrum with only the test cell. The simulation of both side-looking and non side-looking cases has illustrated the effectiveness of the D3SR spectrum estimation using focal underdetermined system solution (FOCUSS) and 1 L norm minimization.Then the clutter covariance matrix (CCM) and the corresponding adaptive filter can be effectively obtained. Since D3SR maintains the full system DOF, it can achieve better performance of output signal-clutter-ratio (SCR) and minimum detectable velocity (MDV) than current D3 methods, e.g., direct data domain least squares (D3LS). Thus D3SR is more effective against the range-dependent clutter and interference in the non-stationary clutter scenario.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.