2021
DOI: 10.48550/arxiv.2111.01936
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Revisiting spatio-temporal layouts for compositional action recognition

Abstract: Recognizing human actions is fundamentally a spatio-temporal reasoning problem, and should be, at least to some extent, invariant to the appearance of the human and the objects involved. Motivated by this hypothesis, in this work, we take an object-centric approach to action recognition. Multiple works have studied this setting before, yet it remains unclear (i) how well a carefully crafted, spatio-temporal layout-based method can recognize human actions, and (ii) how, and when, to fuse the information from la… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 47 publications
0
2
0
Order By: Relevance
“…Obj. Top-1 Top-5 STIN [26] 37.2 62.4 I3D+STIN [26] 48.2 72.6 CAF [30] 52.3 78.9 STRG [42] 52.3 78.3 IRN (Ours) 52.9 80.8 I3D-STIN [26] 51.5 77.1 STRG-STIN [26] 56.2 81.3 CACNF [30] 56.9 82.5 224 × 224. We refer to this augmentation approach as SCR, which we use in all our experiments on EPIC-KITCHENS-100.…”
Section: Datasets Evaluation and Implementation Detailsmentioning
confidence: 99%
“…Obj. Top-1 Top-5 STIN [26] 37.2 62.4 I3D+STIN [26] 48.2 72.6 CAF [30] 52.3 78.9 STRG [42] 52.3 78.3 IRN (Ours) 52.9 80.8 I3D-STIN [26] 51.5 77.1 STRG-STIN [26] 56.2 81.3 CACNF [30] 56.9 82.5 224 × 224. We refer to this augmentation approach as SCR, which we use in all our experiments on EPIC-KITCHENS-100.…”
Section: Datasets Evaluation and Implementation Detailsmentioning
confidence: 99%
“…Such information can be readily and reliably extracted by modern deep learning algorithms and have been reported to enhance the accuracy of action recognition [6]. For example, [7] exploited the positional relations between instances and object categories, and achieved accurate scene-level object-centered action recognition. Recently, [8] proposed a new bi-modal network for action detection, which has an RGB stream and a pose stream, and demonstrated that the heterogeneous features provide essential information for accurate action detection.…”
Section: Introductionmentioning
confidence: 99%