Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition

Han, Mingfei; Zhang, David Junhao; Wang, Yali; Yan, Rui; Yao, Lina; Chang, Xiaojun; Qiao, Yu

doi:10.1109/cvpr52688.2022.00300

Cited by 36 publications

(15 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Clustered attention is used to capture contextual spatial-temporal information, and transformer encoder-based techniques with different backbone networks extract features for learning actor interactions from multimodal inputs [12]. Additionally, MAC-Loss [38], a combination of spatial and temporal transformers in two complimentary orders, has been proposed to enhance the learning effectiveness of actor interactions and preserve actor consistency at the frame and video levels. Tamura et al [39] introduces a framework without using heuristic features for recognizing social group activities and identifying group members.…”

Section: Group Activity Recognition (Gar)mentioning

confidence: 99%

REACT: Recognize Every Action Everywhere All At Once

Chappa,

Nguyen,

Dobbs

et al. 2024

Preprint

View full text Add to dashboard Cite

Group Activity Recognition (GAR) is a fundamental problem in computer vision, with diverse applications in sports video analysis, video surveillance, and social scene understanding. Unlike conventional action recognition, GAR aims to classify the actions of a group of individuals as a whole, requiring a deep understanding of their interactions and spatiotemporal relationships. To address the challenges in GAR, we present REACT (Recognize Every Action Everywhere All At Once), a novel architecture inspired by the transformer encoder-decoder model explicitly designed to model complex contextual relationships within videos, including multi-modality and spatio-temporal features.Our architecture features a cutting-edge Vision-Language Encoder block for integrated temporal, spatial, and multi-modal interaction modeling. This component efficiently encodes spatiotemporal interactions, even with sparsely sampled frames, and recovers essential local information. Our Action Decoder Block refines the joint understanding of text and video data, allowing us to precisely retrieve bounding boxes, enhancing the link between semantics and visual reality. At the core, our Actor Fusion Block orchestrates a fusion of actor-specific data and textual features, striking a balance between specificity and context.Our method outperforms state-of-the-art GAR approaches in extensive experiments, demonstrating superior accuracy in recognizing and understanding group activities. Our architecture's potential extends to diverse real-world applications, offering empirical evidence of its performance gains. This work significantly advances the field of group activity recognition, providing a robust framework for nuanced scene comprehension.

show abstract

Section: Group Activity Recognition (Gar)mentioning

confidence: 99%

REACT: Recognize Every Action Everywhere All At Once

Chappa,

Nguyen,

Dobbs

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…Transformer-based encoders, often coupled with diverse backbone networks, excel in extracting features for discerning actor interactions in multimodal data [ 46 ]. Recent innovations, such as MAC-Loss, introduce dual spatial and temporal transformers for enhanced actor interaction learning [ 47 ]. The field continues to evolve with heuristic-free approaches like those by Tamura et al, simplifying the process of social group activity recognition and member identification [ 48 ].…”

Section: Related Workmentioning

confidence: 99%

HAtt-Flow: Hierarchical Attention-Flow Mechanism for Group-Activity Scene Graph Generation in Videos

Chappa,

Nguyen,

et al. 2024

Sensors

View full text Add to dashboard Cite

Group-activity scene graph (GASG) generation is a challenging task in computer vision, aiming to anticipate and describe relationships between subjects and objects in video sequences. Traditional video scene graph generation (VidSGG) methods focus on retrospective analysis, limiting their predictive capabilities. To enrich the scene-understanding capabilities, we introduced a GASG dataset extending the JRDB dataset with nuanced annotations involving appearance, interaction, position, relationship, and situation attributes. This work also introduces an innovative approach, a Hierarchical Attention–Flow (HAtt-Flow) mechanism, rooted in flow network theory to enhance GASG performance. Flow–attention incorporates flow conservation principles, fostering competition for sources and allocation for sinks, effectively preventing the generation of trivial attention. Our proposed approach offers a unique perspective on attention mechanisms, where conventional “values” and “keys” are transformed into sources and sinks, respectively, creating a novel framework for attention-based models. Through extensive experiments, we demonstrate the effectiveness of our Hatt-Flow model and the superiority of our proposed flow–attention mechanism. This work represents a significant advancement in predictive video scene understanding, providing valuable insights and techniques for applications that require real-time relationship prediction in video data.

show abstract

“…Machine learning-based, especially deep learning, methods are capable of learning features at various levels of abstraction from the training data to obtain better performance than those using hand-crafted features. Among the recent deep learning methods, multi-head self-attention networks (MHSA)-based methods [8][9][10] achieved the best performance with a global receptive field, although not being computationally efficient. Graphs have shown great success in characterizing the structure of a group and the interactions existing in a group in recent years.…”

Section: Introductionmentioning

confidence: 99%

Global Individual Interaction Network Based on Consistency for Group Activity Recognition

Huang,

Zhang,

et al. 2023

Electronics

View full text Add to dashboard Cite

Modeling the interactions among individuals in a group is essential for group activity recognition (GAR). Various graph neural networks (GNNs) are regarded as popular modeling methods for GAR, as they can characterize the interaction among individuals at a low computational cost. The performance of the current GNN-based modeling methods is affected by two factors. Firstly, their local receptive field in the mapping layer limits their ability to characterize the global interactions among individuals in spatial–temporal dimensions. Secondly, GNN-based GAR methods do not have an efficient mechanism to use global activity consistency and individual action consistency. In this paper, we argue that the global interactions among individuals, as well as the constraints of global activity and individual action consistencies, are critical to group activity recognition. We propose new convolutional operations to capture the interactions among individuals from a global perspective. We use contrastive learning to maximize the global activity consistency and individual action consistency for more efficient recognition. Comprehensive experiments show that our method achieved better GAR performance than the state-of-the-art methods on two popular GAR benchmark datasets.

show abstract

Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition

Cited by 36 publications

References 30 publications

REACT: Recognize Every Action Everywhere All At Once

REACT: Recognize Every Action Everywhere All At Once

HAtt-Flow: Hierarchical Attention-Flow Mechanism for Group-Activity Scene Graph Generation in Videos

Global Individual Interaction Network Based on Consistency for Group Activity Recognition

Contact Info

Product

Resources

About