TransCrowd: weakly-supervised crowd counting with transformers

Liang, Dingkang; Chen, Xiwu; Xu, Wei; Zhou, Yu; Bai, Xiang

doi:10.1007/s11432-021-3445-y

Cited by 130 publications

(51 citation statements)

References 60 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Transformers have an inherent advantage in weakly-supervised crowd counting, since they can enhance global information about features and capture contextual knowledge. TransCrowd [12] was the first transformer-based crowd counting framework, which reformulates the counting problem from a sequential perspective to a counting perspective. CCTrans [31] is applicable to both fully-supervised and weakly-supervised data, and uses Twins [32] as a feature extraction framework.…”

Section: Weakly-supervised Crowd Countingmentioning

confidence: 99%

“…A CNN is limited to extracting a global receptive field without using a density map due to the characteristics of local feature extraction. In 2021, a transformer was introduced to the weaklysupervised crowd counting task [12]. The global attention of the corresponding network can effectively overcome the limited receptive field of CNN-based methods.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

DTCC: Multi-level dilated convolution with transformer for weakly-supervised crowd counting

Zhang

Yuan³

et al. 2023

Comp. Visual Media

View full text Add to dashboard Cite

Crowd counting provides an important foundation for public security and urban management. Due to the existence of small targets and large density variations in crowd images, crowd counting is a challenging task. Mainstream methods usually apply convolution neural networks (CNNs) to regress a density map, which requires annotations of individual persons and counts. Weakly-supervised methods can avoid detailed labeling and only require counts as annotations of images, but existing methods fail to achieve satisfactory performance because a global perspective field and multi-level information are usually ignored. We propose a weakly-supervised method, DTCC, which effectively combines multi-level dilated convolution and transformer methods to realize end-to-end crowd counting. Its main components include a recursive swin transformer and a multi-level dilated convolution regression head. The recursive swin transformer combines a pyramid visual transformer with a fine-tuned recursive pyramid structure to capture deep multi-level crowd features, including global features. The multi-level dilated convolution regression head includes multi-level dilated convolution and a linear regression head for the feature extraction module. This module can capture both low- and high-level features simultaneously to enhance the receptive field. In addition, two regression head fusion mechanisms realize dynamic and mean fusion counting. Experiments on four well-known benchmark crowd counting datasets (UCF_CC_50, ShanghaiTech, UCF_QNRF, and JHU-Crowd++) show that DTCC achieves results superior to other weakly-supervised methods and comparable to fully-supervised methods.

show abstract

Section: Weakly-supervised Crowd Countingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

DTCC: Multi-level dilated convolution with transformer for weakly-supervised crowd counting

Zhang

Yuan³

et al. 2023

Comp. Visual Media

View full text Add to dashboard Cite

show abstract

“…In addition, in weakly-supervised crowd counting, there are some other transformer based methods. TransCrowd [17] uses a learnable counting token or global average pooling on high-layer semantic tokens to represent the crowd numbers. It constructs a weakly supervised model from sequence-to-count perspective.…”

Section: Transformer Based Crowd Countingmentioning

confidence: 99%

RGB-T Multi-Modal Crowd Counting Based on Transformer

Liu¹,

Wu²,

Tan³

et al. 2023

Preprint

View full text Add to dashboard Cite

Crowd counting aims to estimate the number of persons in a scene. Most state-of-theart crowd counting methods based on color images can't work well in poor illumination conditions due to invisible objects. With the widespread use of infrared cameras, crowd counting based on color and thermal images is studied. Existing methods only achieve multi-modal fusion without count objective constraint. To better excavate multi-modal information, we use count-guided multi-modal fusion and modal-guided count enhancement to achieve the impressive performance. The proposed count-guided multi-modal fusion module utilizes a multi-scale token transformer to interact two-modal information under the guidance of count information and perceive different scales from the token perspective. The proposed modal-guided count enhancement module employs multi-scale deformable transformer decoder structure to enhance one modality feature and count information by the other modality. Experiment in public RGBT-CC dataset shows that our method refreshes the state-of-the-art results. https://github.com/liuzywen/RGBTCC IntroductionCrowd counting can predict the distribution of crowd and estimate the number of persons in unconstraint scenes. It is widely studied by the academia and industrial communities since the number of persons is an important indicator of incident monitoring[31], traffic control [19], and infectious disease prevention [32]. The existing crowd counting methods have achieved tremendous improvement due to the introduce of convolutional neural networks [7,8] and transformer [28,40].However, when light is insufficient, the performance of crowd counting is unsatisfying, as shown in the first line of Fig. 1. The thermal image can percept the temperature of objects to recognize the persons. Therefore, RGB-Thermal (RGB-T) crowd counting by introducing the thermal modality has attracted a lot of attentions.

show abstract

“…Inspired by the recent prominence of the transformer and its success in many CV problems such as image classification [26,8,27,10,[28][29][30], object detection [31][32][33], segmentation [27,34,35], crowd counting [36,37] and image restoration [38][39][40], Liang et al [12] proposed a new state-of-the-art image restoration model based on the Swin transformer [10]. The SwinIR model consists yet again of three modules: a shallow feature extractor, a transformer-based deep feature extractor and a high-quality image reconstruction module.…”

Section: Swinir Image Restorationmentioning

confidence: 99%

STB-VMM: Swin Transformer Based Video Motion Magnification

Lado-Roigé¹,

Pérez²

2023

Preprint

View full text Add to dashboard Cite

The goal of video motion magnification techniques is to magnify small motions in a video to reveal previously invisible or unseen movement. Its uses extend from bio-medical applications and deep fake detection to structural modal analysis and predictive maintenance. However, discerning small motion from noise is a complex task, especially when attempting to magnify very subtle often sub-pixel movement. As a result, motion magnification techniques generally suffer from noisy and blurry outputs. This work presents a new state-of-the-art model based on the Swin Transformer, which offers better tolerance to noisy inputs as well as higher-quality outputs that exhibit less noise, blurriness and artifacts than prior-art. Improvements in output image quality will enable more precise measurements for any application reliant on magnified video sequences, and may enable further development of video motion magnification techniques in new technical fields.

show abstract

TransCrowd: weakly-supervised crowd counting with transformers

Cited by 130 publications

References 60 publications

DTCC: Multi-level dilated convolution with transformer for weakly-supervised crowd counting

DTCC: Multi-level dilated convolution with transformer for weakly-supervised crowd counting

RGB-T Multi-Modal Crowd Counting Based on Transformer

STB-VMM: Swin Transformer Based Video Motion Magnification

Contact Info

Product

Resources

About