Audio-Visual Transformer Based Crowd Counting

Sajid, Usman; Chen, Xiangyu; Sajid, Hasan; Kim, Taejoon; Wang, Guanghui

doi:10.1109/iccvw54120.2021.00254

Cited by 17 publications

(6 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In AVT [15], a transformer-inspired attention mechanism is deployed to perform inter-branch fusion. Fig.…”

Section: A High-resolution Networkmentioning

confidence: 99%

“…An insufficient receptive field is generated. AVT [15] embeds the audio modality into the image modality only in the last three-branch exchange unit.…”

Section: A High-resolution Networkmentioning

confidence: 99%

“…This design generates an insufficient receptive field. AVT [15] exploits auditory information to aid visual models in crowd counting tasks. Audio embedding is only integrated into image features in the last three-branch exchange unit.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

HRTransNet: HRFormer-Driven Two-Modality Salient Object Detection

Tang

Liu

Tan

et al. 2023

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

High-Resolution Transformer (HRFormer) can maintain high-resolution representation and share global receptive fields. It is friendly towards salient object detection (SOD) in which the input and output have the same resolution. However, two critical problems need to be solved for two-modality SOD. One problem is two-modality fusion. The other problem is the HRFormer output's fusion. To address the first problem, a supplementary modality is injected into the primary modality by using global optimization and an attention mechanism to select and purify the modality at the input level. To solve the second problem, a dual-direction short connection fusion module is used to optimize the output features of HRFormer, thereby enhancing the detailed representation of objects at the output level. The proposed model, named HRTransNet, first introduces an auxiliary stream for feature extraction of supplementary modality. Then, features are injected into the primary modality at the beginning of each multi-resolution branch. Next, HRFormer is applied to achieve forwarding propagation. Finally, all the output features with different resolutions are aggregated by intrafeature and inter-feature interactive transformers. Application of the proposed model results in impressive improvement for driving two-modality SOD tasks, e.g., RGB-D, RGB-T, and light field SOD.https://github.com/liuzywen/HRTransNet

show abstract

“…In AVT [15], a transformer-inspired attention mechanism is deployed to perform inter-branch fusion. Fig.…”

Section: A High-resolution Networkmentioning

confidence: 99%

“…An insufficient receptive field is generated. AVT [15] embeds the audio modality into the image modality only in the last three-branch exchange unit.…”

Section: A High-resolution Networkmentioning

confidence: 99%

See 1 more Smart Citation

HRTransNet: HRFormer-Driven Two-Modality Salient Object Detection

Tang

Liu

Tan

et al. 2023

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

show abstract

“…Hu et al [58] propose an estimation model that jointly learns visual and audio modalities, and release a large-scale audiovisual crowd counting dataset DISCO. Sajid et al [59] propose an audiovisual multi-task network based on the transformer structure to achieve better pattern association and efficient feature extraction. Hu et al [60] propose an Audio-Visual Multi-Scale Network (AVMSN) to model unconstrained visual and auditory sources for crowd counting.…”

Section: B Multi-modal Crowd Countingmentioning

confidence: 99%

MAFNet: A Multi-Attention Fusion Network for RGB-T Crowd Counting

Chen¹,

Ji²,

Yuan³

et al. 2022

Preprint

View full text Add to dashboard Cite

RGB-Thermal (RGB-T) crowd counting is a challenging task, which uses thermal images as complementary information to RGB images to deal with the decreased performance of unimodal RGB-based methods in scenes with low-illumination or similar backgrounds. Most existing methods propose welldesigned structures for cross-modal fusion in RGB-T crowd counting. However, these methods have difficulty in encoding cross-modal contextual semantic information in RGB-T image pairs. Considering the aforementioned problem, we propose a two-stream RGB-T crowd counting network called Multi-Attention Fusion Network (MAFNet), which aims to fully capture long-range contextual information from the RGB and thermal modalities based on the attention mechanism. Specifically, in the encoder part, a Multi-Attention Fusion (MAF) module is embedded into different stages of the two modality-specific branches for cross-modal fusion at the global level. In addition, a Multi-modal Multi-scale Aggregation (MMA) regression head is introduced to make full use of the multi-scale and contextual information across modalities to generate high-quality crowd density maps. Extensive experiments on two popular datasets show that the proposed MAFNet is effective for RGB-T crowd counting and achieves the state-of-the-art performance.

show abstract

“…1 where persons that are far from the camera appear much smaller than those close to it. Existing methods use multi-column structure [1,21,46,48], dilated convolution [2,6,15], high-resolution representation [24], and attention mechanism [18] to enlarge the receptive fields. Under the transformer framework, we propose a multi-scale token transformer to perceive persons with different scales.…”

mentioning

confidence: 99%

RGB-T Multi-Modal Crowd Counting Based on Transformer

Liu¹,

Wu²,

Tan³

et al. 2023

Preprint

View full text Add to dashboard Cite

Crowd counting aims to estimate the number of persons in a scene. Most state-of-theart crowd counting methods based on color images can't work well in poor illumination conditions due to invisible objects. With the widespread use of infrared cameras, crowd counting based on color and thermal images is studied. Existing methods only achieve multi-modal fusion without count objective constraint. To better excavate multi-modal information, we use count-guided multi-modal fusion and modal-guided count enhancement to achieve the impressive performance. The proposed count-guided multi-modal fusion module utilizes a multi-scale token transformer to interact two-modal information under the guidance of count information and perceive different scales from the token perspective. The proposed modal-guided count enhancement module employs multi-scale deformable transformer decoder structure to enhance one modality feature and count information by the other modality. Experiment in public RGBT-CC dataset shows that our method refreshes the state-of-the-art results. https://github.com/liuzywen/RGBTCC IntroductionCrowd counting can predict the distribution of crowd and estimate the number of persons in unconstraint scenes. It is widely studied by the academia and industrial communities since the number of persons is an important indicator of incident monitoring[31], traffic control [19], and infectious disease prevention [32]. The existing crowd counting methods have achieved tremendous improvement due to the introduce of convolutional neural networks [7,8] and transformer [28,40].However, when light is insufficient, the performance of crowd counting is unsatisfying, as shown in the first line of Fig. 1. The thermal image can percept the temperature of objects to recognize the persons. Therefore, RGB-Thermal (RGB-T) crowd counting by introducing the thermal modality has attracted a lot of attentions.

show abstract

Audio-Visual Transformer Based Crowd Counting

Cited by 17 publications

References 43 publications

HRTransNet: HRFormer-Driven Two-Modality Salient Object Detection

HRTransNet: HRFormer-Driven Two-Modality Salient Object Detection

MAFNet: A Multi-Attention Fusion Network for RGB-T Crowd Counting

RGB-T Multi-Modal Crowd Counting Based on Transformer

Contact Info

Product

Resources

About