2022
DOI: 10.1016/j.patcog.2021.108487
|View full text |Cite
|
Sign up to set email alerts
|

Action Transformer: A self-attention model for short-time pose-based human action recognition

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
32
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
2
2
1

Relationship

0
10

Authors

Journals

citations
Cited by 138 publications
(32 citation statements)
references
References 21 publications
0
32
0
Order By: Relevance
“…Related work usually differentiates human activities based on the target task or application. Simple atomic actions, poses and motions are usually easy to identify accurately with various machine learning models [1,9,11,13], whereas composite activities, such as ADLs require either more intricate models or a more complex feature engineering [16]. It is a usual approach to decompose complex ADLs into simpler activities, which are generally easier to recognize [24].…”
Section: Related Workmentioning
confidence: 99%
“…Related work usually differentiates human activities based on the target task or application. Simple atomic actions, poses and motions are usually easy to identify accurately with various machine learning models [1,9,11,13], whereas composite activities, such as ADLs require either more intricate models or a more complex feature engineering [16]. It is a usual approach to decompose complex ADLs into simpler activities, which are generally easier to recognize [24].…”
Section: Related Workmentioning
confidence: 99%
“…In the activity recognition area, Trear [41] proposes a transformer-based RGB-D egocentric activity recognition framework by adapting self-attention to model temporal structure from different modalities. Besides, action-transformer [42], motiontransformer [43], hierarchical-transformer [44], spatial temporal transformer network [45] and STST [46] are designed for skeleton-based activity recognition, modeling temporaland spatial dependencies in the skeleton sequences. MM-ViT [47] factorizes self-attention across the space, time, and modality dimensions, operating in the compressed video domain and exploiting various modalities.…”
Section: Related Workmentioning
confidence: 99%
“…Bai et al [20] introduced an approach based on multi-range feature interchange to capture short-range motion features and long-range dependencies. Finally, advanced transformer-based approaches have been proposed for action recognition [21], [5], [22], [23], [24], [25], often exploiting skeleton points [26], [27], [28], [29], [30]. Different from our work, these approaches consider a standard supervised setting, where all data is labelled and there is no domain shift between training and evaluation datasets.…”
Section: Introductionmentioning
confidence: 99%