A Comprehensive Review of Group Activity Recognition in Videos

Wu, Lifang; Wang, Qi; Jian, Meng; Qiao, Yu; Zhao, Boxuan Simen

doi:10.1007/s11633-020-1258-8

Cited by 29 publications

(17 citation statements)

References 88 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Early work on GAR relies on handcrafted features [13,15,17,18,31,53,66]; yet notable progress has been made in recent years by deep-learning (DL) based approaches [22,41]. We review DL-based methods and refer readers to the comprehensive review of GAR presented in [97]. Early DLbased methods use Convolutional Neural Networks (CNNs) to extract features and then apply recurrent neural networks for temporal modeling [46,58,80,95].…”

Section: Group Activity Recognitionmentioning

confidence: 99%

“…Early DLbased methods use Convolutional Neural Networks (CNNs) to extract features and then apply recurrent neural networks for temporal modeling [46,58,80,95]. Since learning interperson interactions is essential for GAR [97], much of the research explores how to capture the actor relations [4,36,40,72,96]. Several works tackle this problem from a graphbased perspective [40,63,100,101] such as applying Graph Convolutional Networks (GCNs) [49,96].…”

Section: Group Activity Recognitionmentioning

confidence: 99%

“…The Group Activity Recognition (GAR) task detects the activity performed by a group of actors interacting with one other in a short video clip [16,97]. GAR has wide-spread applications in sports analytics, robot-human interaction, Figure 1.…”

Section: Introductionmentioning

confidence: 99%

“…First, GAR requires a compositional understanding of the scene [1]. Because of the crowded scene, it is challenging to learn meaningful representations for GAR over the entire scene [97].…”

Section: Introductionmentioning

confidence: 99%

“…Since group activity often consists of one or more subgroups of actors and scene objects, the final action label depends on a compositional understanding of these entities [97,103]. Second, GAR benefits from relational reasoning over scene elements to understand the relative importance of entities and their interactions [36,101].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

COMPOSER: Compositional Reasoning of Group Activity in Videos with Keypoint-Only Modality

Zhou¹,

Kadav²,

Shamsian³

et al. 2021

Preprint

View full text Add to dashboard Cite

Group Activity Recognition (GAR) detects the activity performed by a group of actors in a short video clip. The task requires the compositional understanding of scene entities and relational reasoning between them. We approach GAR by modeling the video as a series of tokens that represent the multi-scale semantic concepts in the video. We propose COMPOSER, a Multiscale Transformer based architecture that performs attention-based reasoning over tokens at each scale and learns group activity compositionally. In addition, we only use the keypoint modality which reduces scene biases and improves the generalization ability of the model. We improve the multi-scale representations in COMPOSER by clustering the intermediate scale representations, while maintaining consistent cluster assignments between scales. Finally, we use techniques such as auxiliary prediction and novel data augmentations (e.g., Actor Dropout) to aid model training. We demonstrate the model's strength and interpretability on the challenging Volleyball dataset. COMPOSER achieves a new state-of-the-art 94.5% accuracy with the keypoint-only modality. COMPOSER outperforms the latest GAR methods that rely on RGB signals, and performs favorably compared against methods that exploit multiple modalities. Our code will be available.

show abstract

Section: Group Activity Recognitionmentioning

confidence: 99%

Section: Group Activity Recognitionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

“…First, GAR requires a compositional understanding of the scene [1]. Because of the crowded scene, it is challenging to learn meaningful representations for GAR over the entire scene [97].…”

Section: Introductionmentioning

confidence: 99%