2021
DOI: 10.48550/arxiv.2112.05892
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

COMPOSER: Compositional Reasoning of Group Activity in Videos with Keypoint-Only Modality

Abstract: Group Activity Recognition (GAR) detects the activity performed by a group of actors in a short video clip. The task requires the compositional understanding of scene entities and relational reasoning between them. We approach GAR by modeling the video as a series of tokens that represent the multi-scale semantic concepts in the video. We propose COMPOSER, a Multiscale Transformer based architecture that performs attention-based reasoning over tokens at each scale and learns group activity compositionally. In … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 58 publications
(102 reference statements)
0
1
0
Order By: Relevance
“…Note that we should not ignore the appearance information in VideoQA task, as the questions are unconstrained and may contain characters, objects and locations that need to be grounded to videos. This is different from the action segmentation [12] or skeletonbased activity recognition [13,64], where motion is the only critical information.…”
Section: Rethinking Motion Representations In Videoqamentioning
confidence: 93%
“…Note that we should not ignore the appearance information in VideoQA task, as the questions are unconstrained and may contain characters, objects and locations that need to be grounded to videos. This is different from the action segmentation [12] or skeletonbased activity recognition [13,64], where motion is the only critical information.…”
Section: Rethinking Motion Representations In Videoqamentioning
confidence: 93%