Structural pruning of neural network parameters reduces computation, energy, and memory transfer costs during inference. We propose a novel method that estimates the contribution of a neuron (filter) to the final loss and iteratively removes those with smaller scores. We describe two variations of our method using the first and secondorder Taylor expansions to approximate a filter's contribution. Both methods scale consistently across any network layer without requiring per-layer sensitivity analysis and can be applied to any kind of layer, including skip connections. For modern networks trained on ImageNet, we measured experimentally a high (>93%) correlation between the contribution computed by our methods and a reliable estimate of the true importance. Pruning with the proposed methods leads to an improvement over state-ofthe-art in terms of accuracy, FLOPs, and parameter reduction. On ResNet-101, we achieve a 40% FLOPS reduction by removing 30% of the parameters, with a loss of 0.02% in the top-1 accuracy on ImageNet. Code is available at https://github.com/NVlabs/Taylor_pruning.
Estimating the 3D pose of a hand is an essential part of human-computer interaction. Estimating 3D pose using depth or multiview sensors has become easier with recent advances in computer vision, however, regressing pose from a single RGB image is much less straightforward. The main difficulty arises from the fact that 3D pose requires some form of depth estimates, which are ambiguous given only an RGB image. In this paper we propose a new method for 3D hand pose estimation from a monocular image through a novel 2.5D pose representation. Our new representation estimates pose up to a scaling factor, which can be estimated additionally if a prior of the hand size is given. We implicitly learn depth maps and heatmap distributions with a novel CNN architecture. Our system achieves the state-of-the-art estimation of 2D and 3D hand pose on several challenging datasets in presence of severe occlusions.
In this paper, we strive to answer two questions: What is the current state of 3D hand pose estimation from depth images? And, what are the next challenges that need to be tackled? Following the successful Hands In the Million Challenge (HIM2017), we investigate the top 10 state-ofthe-art methods on three tasks: single frame 3D pose estimation, 3D hand tracking, and hand pose estimation during object interaction. We analyze the performance of different CNN structures with regard to hand shape, joint visibility, view point and articulation distributions. Our findings include: (1) isolated 3D hand pose estimation achieves low mean errors (10 mm) in the view point range of [70, 120] degrees, but it is far from being solved for extreme view points; (2) 3D volumetric representations outperform 2D CNNs, better capturing the spatial structure of the depth data; (3) Discriminative methods still generalize poorly to unseen hand shapes; (4) While joint occlusions pose a challenge for most methods, explicit modeling of structure constraints can significantly narrow the gap between errors on visible and occluded joints.
Figure 1: Robustness to variations. Sample part segmentation obtained by SCOPS on different types of image collections: (left) unaligned faces from CelebA [29], (middle) birds from CUB [44] and (right) horses from PASCAL VOC [11] dataset images, showing that SCOPS can be robust to appearance, viewpoint and pose variations. AbstractParts provide a good intermediate representation of objects that is robust with respect to the camera, pose and appearance variations. Existing works on part segmentation is dominated by supervised approaches that rely on large amounts of manual annotations and can not generalize to unseen object categories. We propose a self-supervised deep learning approach for part segmentation, where we devise several loss functions that aids in predicting part segments that are geometrically concentrated, robust to object variations and are also semantically consistent across different object instances. Extensive experiments on different types of image collections demonstrate that our approach can produce part segments that adhere to object boundaries and also more semantically consistent across object instances compared to existing self-supervised techniques.
We present two techniques to improve landmark localization in images from partially annotated datasets. Our primary goal is to leverage the common situation where precise landmark locations are only provided for a small data subset, but where class labels for classification or regression tasks related to the landmarks are more abundantly available. First, we propose the framework of sequential multitasking and explore it here through an architecture for landmark localization where training with class labels acts as an auxiliary signal to guide the landmark localization on unlabeled data. A key aspect of our approach is that errors can be backpropagated through a complete landmark localization model. Second, we propose and explore an unsupervised learning technique for landmark localization based on having a model predict equivariant landmarks with respect to transformations applied to the image. We show that these techniques, improve landmark prediction considerably and can learn effective detectors even when only a small fraction of the dataset has landmark labels. We present results on two toy datasets and four real datasets, with hands and faces, and report new state-of-the-art on two datasets in the wild, e.g. with only 5% of labeled images we outperform previous state-of-the-art trained on the AFLW dataset. Shapes DatasetBlocks Dataset Model HP: λ = 0, α = 0, γ = 0, β = 1, ADAM Model HP: λ = 1, α = 1, β = 1, ADAM Landmark Localization Network Landmark Localization Network Input = 60 × 60 × 1 Input = 60 × 60 × 1 Conv 7 × 7 × 16, ReLU, stride 1, SAME Conv 9 × 9 × 8, ReLU, stride 1, SAME Conv 7 × 7 × 16, ReLU, stride 1, SAME Conv 9 × 9 × 8, ReLU, stride 1, SAME Conv 7 × 7 × 16, ReLU, stride 1, SAME Conv 9 × 9 × 8, ReLU, stride 1, SAME Conv 7 × 7 × 16, ReLU, stride 1, SAME Conv 9 × 9 × 8, ReLU, stride 1, SAME Conv 7 × 7 × 16, ReLU, stride 1, SAME Conv 9 × 9 × 8, ReLU, stride 1, SAME Conv 7 × 7 × 16, ReLU, stride 1, SAME Conv 9 × 9 × 8, ReLU, stride 1, SAME Conv 1 × 1 × 16, ReLU, stride 1, SAME Conv 1 × 1 × 8, ReLU, stride 1, SAME Conv 1 × 1 × 2, ReLU, stride 1, SAME Conv 1 × 1 × 5, ReLU, stride 1, SAME soft-argmax(num channels=2) soft-argmax(num channels=5) Classification Network Classification Network FC #units = 40, ReLU FC #units = 256, ReLU, dropout-prob=.25 FC #units = 2, Linear FC #units = 256, ReLU, dropout-prob=.25 FC #units = 15, Linear softmax(dim=2) softmax(dim=15)Table S12: Architecture details of Seq-MT model used for Hands and Multi-PIE datasets.Hands Dataset Multi-PIE Dataset Model HP: λ = 0.5, α = 0.3, γ = 10 −5 , β = 0.001, ADAM Model HP: λ = 2, α = 0.3, γ = 10 −5 , β = 0.001, ADAM Preprocessing: scale and translation [-10%, 10%] of face bounding box, rotation [-20, 20] applied randomly to every epoch. Landmark Localization Network Landmark Localization Network Input = 64 × 64 × 1 Input = 64 × 64 × 1
We propose a novel multi-sensor system for accurate and power-efficient dynamic car-driver hand-gesture recognition, using a short-range radar, a color camera, and a depth camera, which together make the system robust against variable lighting conditions. We present a procedure to jointly calibrate the radar and depth sensors. We employ convolutional deep neural networks to fuse data from multiple sensors and to classify the gestures. Our algorithm accurately recognizes 10 different gestures acquired indoors and outdoors in a car during the day and at night. It consumes significantly less power than purely vision-based systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.