Multi-Resolution Fusion and Multi-scale Input Priors Based Crowd Counting

Chen

et al. 2021

Preprint

Self Cite

Crowd estimation is a very challenging problem. The most recent study tries to exploit auditory information to aid the visual models, however, the performance is limited due to the lack of an effective approach for feature extraction and integration. The paper proposes a new audiovisual multi-task network to address the critical challenges in crowd counting by effectively utilizing both visual and audio inputs for better modalities association and productive feature extraction. The proposed network introduces the notion of auxiliary and explicit image patch-importance ranking (PIR) and patch-wise crowd estimate (PCE) information to produce a third (run-time) modality. These modalities (audio, visual, run-time) undergo a transformer-inspired cross-modality co-attention mechanism to finally output the crowd estimate. To acquire rich visual features, we propose a multi-branch structure with transformer-style fusion in-between. Extensive experimental evaluations show that the proposed scheme outperforms the state-of-the-art networks under all evaluation settings with up to 33.8% improvement. We also analyze and compare the vision-only variant of our network and empirically demonstrate its superiority over previous approaches.

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Audio-Visual Transformer Based Crowd Counting

Chen

et al. 2021

Preprint

Self Cite

“…With the explosion in the amount of data we generate today, deep learning, as a data-driven method, has become the hotspot and achieved significant performance in many directions in computer vision [1], [2], [3], [4], [5], [6]. However, there are still many scenarios when we cannot access enough training data, especially in the medical field.…”

Section: Introductionmentioning

confidence: 99%

Few-Shot Learning by Integrating Spatial and Frequency Representation

Chen

Wang

2021

Preprint

Self Cite

Human beings can recognize new objects with only a few labeled examples, however, few-shot learning remains a challenging problem for machine learning systems. Most previous algorithms in few-shot learning only utilize spatial information of the images. In this paper, we propose to integrate the frequency information into the learning model to boost the discrimination ability of the system. We employ Discrete Cosine Transformation (DCT) to generate the frequency representation, then, integrate the features from both the spatial domain and frequency domain for classification. The proposed strategy and its effectiveness are validated with different backbones, datasets, and algorithms. Extensive experiments demonstrate that the frequency information is complementary to the spatial representations in few-shot classification. The classification accuracy is boosted significantly by integrating features from both the spatial and frequency domains in different few-shot learning tasks.

“…Counting-by-regression [5,6,7] schemes learn the mapping of the input image or patch to its crowd count, whereas the density-map estimation methods [8,9,10,11,12,13,14] yield the crowd-density value per input image pixel that are summed to get the image final crowd count. In general, countingby-regression schemes do not perform reasonably well without any special and additive mechanism.…”

Section: Introductionmentioning

confidence: 99%

Towards More Effective PRM-based Crowd Counting via A Multi-resolution Fusion and Attention Network

Sajid¹,

Wang²

2021

Preprint

Self Cite

The paper focuses on improving the recent plug-and-play patch rescaling module (PRM) based approaches for crowd counting. In order to make full use of the PRM potential and obtain more reliable and accurate results for challenging images with crowd-variation, large perspective, extreme occlusions, and cluttered background regions, we propose a new PRM based multi-resolution and multi-task crowd counting network by exploiting the PRM module with more effectiveness and potency. The proposed model consists of three deeplayered branches with each branch generating feature maps of different resolutions. These branches perform a feature-level fusion across each other to build the vital collective knowledge to be used for the final crowd estimate. Additionally, early-stage feature maps undergo visual attention to strengthen the later-stage channel's understanding of the foreground regions. The integration of these deep branches with the PRM module and the early-attended blocks proves to be more effective than the original PRM based schemes through extensive numerical and visual evaluations on four benchmark datasets. The proposed approach yields a significant improvement by a margin of 12.6% in terms of the RMSE evaluation criterion. It also outperforms state-of-the-art methods in cross-dataset evaluations.