2021
DOI: 10.48550/arxiv.2111.09641
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Evaluating Transformers for Lightweight Action Recognition

Abstract: In video action recognition, transformers consistently reach state-of-the-art accuracy. However, many models are too heavyweight for the average researcher with limited hardware resources. In this work, we explore the limitations of video transformers for lightweight action recognition. We benchmark 13 video transformers and baselines across 3 large-scale datasets and 10 hardware devices. Our study is the first to evaluate the efficiency of action recognition models in depth across multiple devices and train a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 30 publications
0
2
0
Order By: Relevance
“…In their evaluation of previous work, Koot et al [ 88 ] discovered that CNN performs better than Transformer when it comes to latency accuracy on lightweight datasets. CNN is also described to capture inductive bias, which is also known as prior knowledge, such as translation equivariance and localization while having pooling operation give partial scale invariance [ 89 ].…”
Section: Comparative Study Between Cnn Vision Transformer and Hybrid ...mentioning
confidence: 99%
“…In their evaluation of previous work, Koot et al [ 88 ] discovered that CNN performs better than Transformer when it comes to latency accuracy on lightweight datasets. CNN is also described to capture inductive bias, which is also known as prior knowledge, such as translation equivariance and localization while having pooling operation give partial scale invariance [ 89 ].…”
Section: Comparative Study Between Cnn Vision Transformer and Hybrid ...mentioning
confidence: 99%
“…Previously, Koot et al [120] discovered that CNN performs better than Transformer when it comes to latency accuracy on lightweight datasets. However, CNN has a few weaknesses, including a slowness that is brought on by the max pooling operation; additionally, in contrast to the Transformer, it does not consider several perspectives that can be gained by learning, [121] which leads to disregard for global knowledge.…”
Section: The Roles Of Transformers In Predicting the Use Of Drug Comb...mentioning
confidence: 99%