Learning discriminative space-time actions from weakly labelled videos

Sapienza, Michael; Cuzzolin, Fabio; Torr, Philip H. S.

doi:10.5244/c.26.123

Cited by 34 publications

(32 citation statements)

References 32 publications

(62 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A video is represented as a bag of kinematic modes. Closely related to our work is the work of Sapienza et al [16] where discriminative action subvolumes are learned in a weakly supervised setting. The learned models are used to classify and localize actions.…”

Section: Related Workmentioning

confidence: 99%

“…Instead of using subvolumes representation, we use trajectories and extract trajectory groups [14] from the video and aim to learn the discriminative trajectory groups to represent the video. Most importantly, our representation maintains the structural spatio-temporal information in each bag, while [16] treated each instances in a bag independently.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Action Recognition Using Discriminative Structured Trajectory Groups

Atmosukarto

Ahuja

Ghanem

2015

2015 IEEE Winter Conference on Applications of Computer Vision

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Action Recognition Using Discriminative Structured Trajectory Groups

Atmosukarto

Ahuja

Ghanem

2015

2015 IEEE Winter Conference on Applications of Computer Vision

View full text Add to dashboard Cite

“…Moreover, discriminative space-time video patches have also been exploited for action recognition [37]. However, action recognition approaches are Person re-identification challenges in public space scenes [42].…”

Section: Introductionmentioning

confidence: 99%

Person Re-identification by Video Ranking

Wang

Gong

Zhu

et al. 2014

Lecture Notes in Computer Science

505

558

View full text Add to dashboard Cite

Abstract. Current person re-identification (re-id) methods typically rely on single-frame imagery features, and ignore space-time information from image sequences. Single-frame (single-shot) visual appearance matching is inherently limited for person re-id in public spaces due to visual ambiguity arising from non-overlapping camera views where viewpoint and lighting changes can cause significant appearance variation. In this work, we present a novel model to automatically select the most discriminative video fragments from noisy image sequences of people where more reliable space-time features can be extracted, whilst simultaneously to learn a video ranking function for person re-id. Also, we introduce a new image sequence re-id dataset (iLIDS-VID) based on the i-LIDS MCT benchmark data. Using the iLIDS-VID and PRID 2011 sequence re-id datasets, we extensively conducted comparative evaluations to demonstrate the advantages of the proposed model over contemporary gait recognition, holistic image sequence matching and state-of-the-art singleshot/multi-shot based re-id methods.

show abstract

“…Laptev et alattempt to mitigate this by splitting the spatio-temporal volume into sub-blocks, creating a descriptor for each sub-block, and concatenating them to create the sequence descriptor (Laptev et al 2008). Sapienza et alfollow a similar vein, encoding individual sub-sequences, however rather than concatenating to create a single descriptor, they employ Multiple Instance Learning (MIL) (Sapienza et al 2012). This accounts for some parts of the sequence being irrelevant, for example before and after the action.…”

Section: Related Workmentioning

confidence: 99%

Hollywood 3D: What are the Best 3D Features for Action Recognition?

2016

View full text Add to dashboard Cite

Action recognition "in the wild" is extremely challenging, particularly when complex 3D actions are projected down to the image plane, losing a great deal of information. The recent growth of 3D data in broadcast content and commercial depth sensors, makes it possible to overcome this. However, there is little work examining the best way to exploit this new modality. In this paper we introduce the Hollywood 3D benchmark, which is the first dataset containing "in the wild" action footage including 3D data. This dataset consists of 650 stereo video clips across 14 action classes, taken from Hollywood movies. We provide stereo calibrations and depth reconstructions for each clip. We also provide an action recognition pipeline, and propose a number of specialised depth-aware techniques including five interest point detectors and three feature descriptors. Extensive tests allow evaluation of different appearance and depth encoding schemes. Our novel techniques exploiting this depth allow us to reach performance levels more than triple those of the best baseline algorithm using only appearance information. The benchmark data, code and calibrations are all made available to the community.

show abstract

Learning discriminative space-time actions from weakly labelled videos

Cited by 34 publications

References 32 publications

Action Recognition Using Discriminative Structured Trajectory Groups

Action Recognition Using Discriminative Structured Trajectory Groups

Person Re-identification by Video Ranking

Hollywood 3D: What are the Best 3D Features for Action Recognition?

Contact Info

Product

Resources

About