Multi-Camera Multiple 3D Object Tracking on the Move for Autonomous Vehicles

Nguyen, Pha; Quach, Kha Gia; Duong, Chi Nhan; Le, Ngan; Nguyen, Xuan-Bac; Luu, Khoa

doi:10.1109/cvprw56347.2022.00289

Cited by 6 publications

(3 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In scene graph generation (SGG), the traditional twostage paradigm involves object detection and pairwise predicate estimation [49][50][51][52][53][54][55][56][57][58][59][60]. Recent advancements include knowledge graph embeddings, graph-based architectures, energybased models, and linguistic supervision [56,[61][62][63][64][65][66][67][68][69]. To address challenges like long-tailed distribution and visually irrelevant predicates, the field has seen a pivot towards panoptic segmentation-based SGG, inspired by the simultaneous generation of scene graphs and semantic segmentation masks [34].…”

Section: Related Workmentioning

confidence: 99%

HAtt-Flow: Hierarchical Attention-Flow Mechanism for Group-Activity Scene Graph Generation in Videos

Chappa,

Nguyen,

et al. 2024

Sensors

Self Cite

View full text Add to dashboard Cite

Group-activity scene graph (GASG) generation is a challenging task in computer vision, aiming to anticipate and describe relationships between subjects and objects in video sequences. Traditional video scene graph generation (VidSGG) methods focus on retrospective analysis, limiting their predictive capabilities. To enrich the scene-understanding capabilities, we introduced a GASG dataset extending the JRDB dataset with nuanced annotations involving appearance, interaction, position, relationship, and situation attributes. This work also introduces an innovative approach, a Hierarchical Attention–Flow (HAtt-Flow) mechanism, rooted in flow network theory to enhance GASG performance. Flow–attention incorporates flow conservation principles, fostering competition for sources and allocation for sinks, effectively preventing the generation of trivial attention. Our proposed approach offers a unique perspective on attention mechanisms, where conventional “values” and “keys” are transformed into sources and sinks, respectively, creating a novel framework for attention-based models. Through extensive experiments, we demonstrate the effectiveness of our Hatt-Flow model and the superiority of our proposed flow–attention mechanism. This work represents a significant advancement in predictive video scene understanding, providing valuable insights and techniques for applications that require real-time relationship prediction in video data.

show abstract

Section: Related Workmentioning

confidence: 99%

HAtt-Flow: Hierarchical Attention-Flow Mechanism for Group-Activity Scene Graph Generation in Videos

Chappa,

Nguyen,

et al. 2024

Sensors

Self Cite

View full text Add to dashboard Cite

show abstract

“…Multiple cameras often have shooting coverage areas in the spatial distribution, which ensures that there are no blind spots in monitoring and can continuously track target objects [23]. To automatically determine the target in the next camera's field of vision, it is necessary to match the target in the overlapping area of the camera.…”

Section: B Construction Of a Multi Camerapositioning Systemmentioning

confidence: 99%

Multi Camera Localization Handover Based on YOLO Object Detection Algorithm in Complex Environments

Wu,

Lai

2024

IEEE Access

View full text Add to dashboard Cite

With the development of computer vision, image processing, and other technologies, the management of smart cities has been enhanced, and intelligent visual detection and tracking technology has progressed. A single-camera monitoring system presents challenges, including limited observation range, unstable tracking, and difficulties in recognizing complex scene obstructions. To overcome these obstacles, a multi-camera monitoring system must be implemented. To enhance the accuracy of multiple cameras' positioning and recognition, while also increasing their efficiency in recognizing targets, this study employs a novel approach that combines spatial mapping based on position data and feature matching based on target objects. Firstly, in the overlapping area of multiple camera targets, a uniform spatial constraint method is used to map and match the target object. The color features of the target object are used for matching. Secondly, the You only look once (YOLO) object detection algorithm is introduced to recognize targets within the overlapping area of the camera using homologous transformation. In this way, a multi camera positioning technology based on YOLO object detection algorithm is designed. The test results show that the YOLOv5 algorithm has a maximum mAP accuracy of 97.2% on the test set. At a reasoning speed of 10 ms, the YOLOv5 algorithm has a maximum mAP accuracy of 51.6%. The average values of the classification loss function, target loss function, and GloU loss function of the YOLOv5 algorithm are 0.001, 0.01, and 0.015, respectively. The error probability of YOLO within 10cm in the DukeMTMC re TD dataset remains above 96.5%. The error probability of YOLO within 9.5cm in the OTB dataset remains above 95%. When the target object is blocked, the highest accuracy of the YOLO positioning system is 0.74. The above results indicate that the multi camera localization technology based on YOLO object detection algorithm can improve the accuracy of localization and recognition. It can also solve the problems of object occlusion recognition and continuous object tracking.

show abstract

“…Unlike conventional action recognition methods that focus on identifying individual actions, GAR aims to classify the actions of a group of people in a given video clip as a whole. It requires a deeper understanding of the interactions between multiple actors, including accurate localization of actors and modeling their spatiotemporal relationships [1][2][3][4][5][6][7][8]. As a result, GAR poses fundamental challenges that must be addressed to develop practical solutions for this problem.…”

Section: Introductionmentioning

confidence: 99%

REACT: Recognize Every Action Everywhere All At Once

Chappa,

Nguyen,

Dobbs

et al. 2024

Preprint

Self Cite

View full text Add to dashboard Cite

Group Activity Recognition (GAR) is a fundamental problem in computer vision, with diverse applications in sports video analysis, video surveillance, and social scene understanding. Unlike conventional action recognition, GAR aims to classify the actions of a group of individuals as a whole, requiring a deep understanding of their interactions and spatiotemporal relationships. To address the challenges in GAR, we present REACT (Recognize Every Action Everywhere All At Once), a novel architecture inspired by the transformer encoder-decoder model explicitly designed to model complex contextual relationships within videos, including multi-modality and spatio-temporal features.Our architecture features a cutting-edge Vision-Language Encoder block for integrated temporal, spatial, and multi-modal interaction modeling. This component efficiently encodes spatiotemporal interactions, even with sparsely sampled frames, and recovers essential local information. Our Action Decoder Block refines the joint understanding of text and video data, allowing us to precisely retrieve bounding boxes, enhancing the link between semantics and visual reality. At the core, our Actor Fusion Block orchestrates a fusion of actor-specific data and textual features, striking a balance between specificity and context.Our method outperforms state-of-the-art GAR approaches in extensive experiments, demonstrating superior accuracy in recognizing and understanding group activities. Our architecture's potential extends to diverse real-world applications, offering empirical evidence of its performance gains. This work significantly advances the field of group activity recognition, providing a robust framework for nuanced scene comprehension.

show abstract

Multi-Camera Multiple 3D Object Tracking on the Move for Autonomous Vehicles

Cited by 6 publications

References 18 publications

HAtt-Flow: Hierarchical Attention-Flow Mechanism for Group-Activity Scene Graph Generation in Videos

HAtt-Flow: Hierarchical Attention-Flow Mechanism for Group-Activity Scene Graph Generation in Videos

Multi Camera Localization Handover Based on YOLO Object Detection Algorithm in Complex Environments

REACT: Recognize Every Action Everywhere All At Once

Contact Info

Product

Resources

About