2021
DOI: 10.48550/arxiv.2109.01926
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Audio-Visual Transformer Based Crowd Counting

Abstract: Crowd estimation is a very challenging problem. The most recent study tries to exploit auditory information to aid the visual models, however, the performance is limited due to the lack of an effective approach for feature extraction and integration. The paper proposes a new audiovisual multi-task network to address the critical challenges in crowd counting by effectively utilizing both visual and audio inputs for better modalities association and productive feature extraction. The proposed network introduces … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
1
1

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 55 publications
(85 reference statements)
0
2
0
Order By: Relevance
“…Transformer architecture for vision tasks has recently been presented. Visual transformer (ViT) (Han et al, 2020 ; Sajid et al, 2021 ; Truong et al, 2021 ) establish the possibility of pure transformer architectures for computer vision tasks as a pioneering study. Transformer blocks are utilized as standalone architectures or presented into CNNs for semantic segmentation, image classification, image generation, image enhancement, and object detection to manipulate long-range dependencies.…”
Section: Related Studymentioning
confidence: 99%
“…Transformer architecture for vision tasks has recently been presented. Visual transformer (ViT) (Han et al, 2020 ; Sajid et al, 2021 ; Truong et al, 2021 ) establish the possibility of pure transformer architectures for computer vision tasks as a pioneering study. Transformer blocks are utilized as standalone architectures or presented into CNNs for semantic segmentation, image classification, image generation, image enhancement, and object detection to manipulate long-range dependencies.…”
Section: Related Studymentioning
confidence: 99%
“…Crowd counting aims to estimate the total number of people in a given static image. This is a very challenging problem in practice since there exists a significant difference in the crowd number in and across different images, varying images resolution, large perspective, and severe occlusions [4], as shown in Fig. 1.…”
Section: Introductionmentioning
confidence: 99%