2022
DOI: 10.48550/arxiv.2201.04676
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning

Abstract: It is a challenging task to learn rich and multi-scale spatiotemporal semantics from high-dimensional videos, due to large local redundancy and complex global dependency between video frames. The recent advances in this research have been mainly driven by 3D convolutional neural networks and vision transformers. Although 3D convolution can efficiently aggregate local context to suppress local redundancy from a small 3D neighborhood, it lacks the capability to capture global dependency because of the limited re… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
50
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 25 publications
(58 citation statements)
references
References 20 publications
(41 reference statements)
0
50
0
Order By: Relevance
“…For image classification, we evaluate iFormer on the ImageNet dataset [28]. We train the iFormer model with the standard procedure in [6,22,29]. Specifically, we use AdamW optimizer with an initial learning rate 1 × 10 −3 via cosine decay [69], a momentum of 0.9, and a weight decay of 0.05.…”
Section: Results On Image Classificationmentioning
confidence: 99%
See 4 more Smart Citations
“…For image classification, we evaluate iFormer on the ImageNet dataset [28]. We train the iFormer model with the standard procedure in [6,22,29]. Specifically, we use AdamW optimizer with an initial learning rate 1 × 10 −3 via cosine decay [69], a momentum of 0.9, and a weight decay of 0.05.…”
Section: Results On Image Classificationmentioning
confidence: 99%
“…Compared with various Transformer backbones, our iFormers still maintain the performance superiority over their results. For example, our iFormer-B surpasses UniFormer-B [22], Swin-S [5] by 0.9 points of AP b and 3.5 points of AP b respectively.…”
Section: Results On Object Detection and Instance Segmentationmentioning
confidence: 99%
See 3 more Smart Citations