2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) 2021
DOI: 10.1109/iccvw54120.2021.00254
|View full text |Cite
|
Sign up to set email alerts
|

Audio-Visual Transformer Based Crowd Counting

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
6
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 17 publications
(6 citation statements)
references
References 43 publications
0
6
0
Order By: Relevance
“…In AVT [15], a transformer-inspired attention mechanism is deployed to perform inter-branch fusion. Fig.…”
Section: A High-resolution Networkmentioning
confidence: 99%
See 2 more Smart Citations
“…In AVT [15], a transformer-inspired attention mechanism is deployed to perform inter-branch fusion. Fig.…”
Section: A High-resolution Networkmentioning
confidence: 99%
“…An insufficient receptive field is generated. AVT [15] embeds the audio modality into the image modality only in the last three-branch exchange unit.…”
Section: A High-resolution Networkmentioning
confidence: 99%
See 1 more Smart Citation
“…Hu et al [58] propose an estimation model that jointly learns visual and audio modalities, and release a large-scale audiovisual crowd counting dataset DISCO. Sajid et al [59] propose an audiovisual multi-task network based on the transformer structure to achieve better pattern association and efficient feature extraction. Hu et al [60] propose an Audio-Visual Multi-Scale Network (AVMSN) to model unconstrained visual and auditory sources for crowd counting.…”
Section: B Multi-modal Crowd Countingmentioning
confidence: 99%
“…1 where persons that are far from the camera appear much smaller than those close to it. Existing methods use multi-column structure [1,21,46,48], dilated convolution [2,6,15], high-resolution representation [24], and attention mechanism [18] to enlarge the receptive fields. Under the transformer framework, we propose a multi-scale token transformer to perceive persons with different scales.…”
mentioning
confidence: 99%