DecideNet: Counting Varying Density Crowds Through Attention Guided Detection and Density Estimation

Liu, Jiang; Gao, Chenqiang; Meng, Deyu; Hauptmann, Alexander G.

doi:10.1109/cvpr.2018.00545

Cited by 352 publications

(253 citation statements)

References 42 publications

Supporting

Mentioning

251

Contrasting

Order By: Relevance

“…Two examples are shown in Figure 1. Similar to [3,19,20], we observe that dense-crowd regions are usually underestimated, while sparse-crowd regions are overestimated. Such phenomenon is due to two main factors.…”

Section: Introductionsupporting

confidence: 60%

Learning Spatial Awareness to Improve Crowd Counting

Cheng

Dai

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

127

View full text Add to dashboard Cite

The aim of crowd counting is to estimate the number of people in images by leveraging the annotation of center positions for pedestrians' heads. Promising progresses have been made with the prevalence of deep Convolutional Neural Networks. Existing methods widely employ the Euclidean distance (i.e., L 2 loss) to optimize the model, which, however, has two main drawbacks: (1) the loss has difficulty in learning the spatial awareness (i.e., the position of head) since it struggles to retain the high-frequency variation in the density map, and (2) the loss is highly sensitive to various noises in crowd counting, such as the zeromean noise, head size changes, and occlusions. Although the Maximum Excess over SubArrays (MESA) loss has been previously proposed by [16] to address the above issues by finding the rectangular subregion whose predicted density map has the maximum difference from the ground truth, it cannot be solved by gradient descent, thus can hardly be integrated into the deep learning framework. In this paper, we present a novel architecture called SPatial Awareness Network (SPANet) to incorporate spatial context for crowd counting. The Maximum Excess over Pixels (MEP) loss is proposed to achieve this by finding the pixel-level subregion with high discrepancy to the ground truth. To this end, we devise a weakly supervised learning scheme to generate such region with a multi-branch architecture. The proposed framework can be integrated into existing deep crowd counting methods and is end-to-end trainable. Extensive experiments on four challenging benchmarks show that our method can significantly improve the performance of baselines. More remarkably, our approach outperforms the state-of-the-art methods on all benchmark datasets.

show abstract

Section: Introductionsupporting

confidence: 60%

Learning Spatial Awareness to Improve Crowd Counting

Cheng

Dai

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

127

View full text Add to dashboard Cite

show abstract

“…The influence of Cmax to SS-DCNet on the ShanghaiTech Part_A dataset [53]. The numbers in the brackets denote quantiles of the training set, for example, 22 (95%) means the 95% quantile is 22. 'VGG16 Encoder' is the classification baseline without S-DC. 'One-Linear' and 'Two-Linear' are defined in Section 6.1.1.…”

Section: How Many Times To Divide?mentioning

confidence: 99%

From Open Set to Closed Set: Counting Objects by Spatial Divide-and-Conquer

Xiong

Liu

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

148

107

View full text Add to dashboard Cite

Visual counting, a task that aims to estimate the number of objects from an image/video, is an open-set problem by nature, i.e., the number of population can vary in [0, +∞) in theory. However, collected data and labeled instances are limited in reality, which means that only a small closed set is observed. Existing methods typically model this task in a regression manner, while they are prone to suffer from an unseen scene with counts out of the scope of the closed set. In fact, counting has an interesting and exclusive property-spatially decomposable. A dense region can always be divided until sub-region counts are within the previously observed closed set. We therefore introduce the idea of spatial divide-and-conquer (S-DC) that transforms open-set counting into a closed-set problem. This idea is implemented by a novel Supervised Spatial Divide-and-Conquer Network (SS-DCNet). Thus, SS-DCNet can only learn from a closed set but generalize well to open-set scenarios via S-DC. SS-DCNet is also efficient. To avoid repeatedly computing sub-region convolutional features, S-DC is executed on the feature map instead of on the input image. We provide theoretical analyses as well as a controlled experiment on toy data, demonstrating why closed-set modeling makes sense. Extensive experiments show that SS-DCNet achieves the state-of-the-art performance on three crowd counting datasets (ShanghaiTech, UCF_CC_50 and UCF-QNRF), a vehicle counting dataset (TRANCOS) and a plant counting dataset (MTC), with a 7.7% relative improvement on the UCF-QNRF, 33.1% on the TRANCOS, and 26.4% on the MTC. SS-DCNet also reports the state-of-the-art cross-domain performance on crowd counting datasets. Particularly in the task from UCF-QNRF to ShanghaiTech Part_A, SS-DCNet even beats most existing models trained directly on the target domain. Code and models have been made available at: https://tinyurl.com/SS-DCNet.

show abstract

“…Sam et al [50] introduce a switching structure, which uses a classifier to assign input image patches to best column structures. Recently, Liu et al [32] propose a multi-column network to simultaneously estimate crowd density by detection and regression models. Ranjan et al [44] employ a two-column structure to iterative train their model with different resolution images.…”

Section: Cnn-based Methodsmentioning

confidence: 99%

Improving the Learning of Multi-column Convolutional Neural Network for Crowd Counting

Cheng

Dai

et al. 2019

Proceedings of the 27th ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

Tremendous variation in the scale of people/head size is a critical problem for crowd counting. To improve the scale invariance of feature representation, recent works extensively employ Convolutional Neural Networks with multi-column structures to handle different scales and resolutions. However, due to the substantial redundant parameters in columns, existing multi-column networks invariably exhibit almost the same scale features in different columns, which severely affects counting accuracy and leads to overfitting. In this paper, we attack this problem by proposing a novel Multicolumn Mutual Learning (McML) strategy. It has two main innovations: 1) A statistical network is incorporated into the multi-column framework to estimate the mutual information between columns, which can approximately indicate the scale correlation between features from different columns. By minimizing the mutual information, each column is guided to learn features with different image scales. 2) We devise a mutual learning scheme that can alternately optimize each column while keeping the other columns fixed on each mini-batch training data. With such asynchronous parameter update process, each column is inclined to learn different feature representation from others, which can efficiently reduce the parameter redundancy and improve generalization ability. More remarkably, McML can be applied to all existing multi-column networks and is end-to-end trainable. Extensive experiments on four challenging benchmarks show that McML can significantly improve the original multi-column networks and outperform the other state-of-the-art approaches.

show abstract

DecideNet: Counting Varying Density Crowds Through Attention Guided Detection and Density Estimation

Cited by 352 publications

References 42 publications

Learning Spatial Awareness to Improve Crowd Counting

Learning Spatial Awareness to Improve Crowd Counting

From Open Set to Closed Set: Counting Objects by Spatial Divide-and-Conquer

Improving the Learning of Multi-column Convolutional Neural Network for Crowd Counting

Contact Info

Product

Resources

About