Semantic Event Fusion of Different Visual Modality Concepts for Activity Recognition

Crispim-Junior, Carlos; Buso, Vincent; Avgerinakis, Konstantinos; Meditskos, Georgios; Briassouli, Alexia; Benois-Pineau, Jenny; Kompatsiaris, Ioannis; Brémond, François

doi:10.1109/tpami.2016.2537323

Cited by 35 publications

(19 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Various approaches have been proposed to use knowledge representation for video understanding such as semantic-visual knowledge bases like FrameNet and Imagenet for modeling rich event-centric concepts and their relationships for video event detection [43], a knowledge and probabilistic driven framework for activity recognition [44], semantic representations for event detection [45], [46]. Souza et al deploy objects, actions and their bonds into graphs and use simulated annealing for event inference using temporal connections [47], [48].…”

Section: Knowledge Representation For Video Understandingmentioning

confidence: 99%

Long Activity Video Understanding Using Functional Object-Oriented Network

Jelodar

Paulius

Sun

2019

IEEE Trans. Multimedia

View full text Add to dashboard Cite

Video understanding is one of the most challenging topics in computer vision. In this paper, a four-stage video understanding pipeline is presented to simultaneously recognize all atomic actions and the single on-going activity in a video. This pipeline uses objects and motions from the video and a graphbased knowledge representation network as prior reference. Two deep networks are trained to identify objects and motions in each video sequence associated with an action. Low Level image features are then used to identify objects of interest in that video sequence. Confidence scores are assigned to objects of interest based on their involvement in the action and to motion classes based on results from a deep neural network that classifies the on-going action in video into motion classes. Confidence scores are computed for each candidate functional unit associated with an action using a knowledge representation network, object confidences, and motion confidences. Each action is therefore associated with a functional unit and the sequence of actions is further evaluated to identify the single on-going activity in the video. The knowledge representation used in the pipeline is called the functional object-oriented network which is a graph-based network useful for encoding knowledge about manipulation tasks. Experiments are performed on a dataset of cooking videos to test the proposed algorithm with action inference and activity classification. Experiments show that using functional object oriented network improves video understanding significantly.

show abstract

Section: Knowledge Representation For Video Understandingmentioning

confidence: 99%

Long Activity Video Understanding Using Functional Object-Oriented Network

Jelodar

Paulius

Sun

2019

IEEE Trans. Multimedia

View full text Add to dashboard Cite

show abstract

“…); Calibration and synchronization of multi-modal data; Multi-modal datasets and evaluation metrics; Leisure, security, health and energy applications based on multi-modal data; Multi-modal Affective Computing and Social Signal processing systems; Multi-modal algorithms designed for GPU, smart phones and game consoles. A total of 17 papers were published withiin this Special Issue at IEEE TPAMI [31], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47]. 2017 TPAMI Faces: Currently we are organizing a Special Issue at IEEE Transactions on Pattern Analysis and Machine Intelligence journal in the topic of face analysis.…”

Section: Special Issuesmentioning

confidence: 99%

ChaLearn looking at people: A review of events and resources

Escalera¹,

Baró

Guyon

2017

2017 International Joint Conference on Neural Networks (IJCNN)

View full text Add to dashboard Cite

Abstract-This paper reviews the historic of ChaLearn Looking at People (LAP) events. We started in 2011 (with the release of the first Kinect device) to run challenges related to human action/activity and gesture recognition. Since then we have regularly organized events in a series of competitions covering all aspects of visual analysis of humans. So far we have organized more than 10 international challenges and events in this field. This paper reviews associated events, and introduces the ChaLearn LAP platform where public resources (including code, data and preprints of papers) related to the organized events are available. We also provide a discussion on our main findings and perspectives of ChaLearn LAP activities.

show abstract

“…Considering the 100 houses deployment plan and its financial viability, the low cost consumer RGBD camera, Asus Xtion, was selected. Furthermore, as previously mentioned, RGBD devices now represent the state of the art for indoor activity monitoring [9,20,21,22,11]. For the SPHERE project, the camera needs to be coupled with a machine with suitable processing capacity, minimal intrusion on the user, and minimal cost.…”

Section: Hardware Platformmentioning

confidence: 99%

“…Video based systems are efficient for implementing alert systems to detect dangerous events like falls, as in [20]. Furthermore, video data analysis allows one to identify specific actions, long term activities, and behavioural patterns [9], with some exploiting contextual information [11]. While video based platforms offer the opportunity to extract unique, continuous, and rich information from the home environment, they also present a number of disadvantages, such as privacy issues [9], user acceptance and system cost and scalability.…”

Section: Introductionmentioning

confidence: 99%

“…More flexible approaches based on smart cameras and distributed processing have been also employed, for example in [14,15], where colour data was processed in each node and processing results transmitted to central storage nodes. As previously mentioned, RGBD cameras are becoming increasingly more deployed in AAL applications, such as [9,20,27,21,22,11]. For a review of video based AAL systems, the reader is referred to [7,8].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation