“…Benefiting from the strong representation learning ability of convolutional neural networks (CNN) [10], [11], [12], [13], [14], CNN-based methods [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31] are employed to predict a density map of a still image because the density map contains more spatial information of people distribution and its integral equals the number of people in one image. For example, multi-branch architectures [16], [18], [17], [19] are designed to extract the multi-scale features and detect varying sizes of heads because different-sized convolutional filters have varying receptive fields, which are more useful for learning non-uniform crowd distribution.…”