Dense saliency-based spatiotemporal feature points for action recognition

Rapantzikos, Konstantinos; Avrithis, Yannis; Kollias, Stefanos

doi:10.1109/cvprw.2009.5206525

Cited by 29 publications

(34 citation statements)

References 6 publications

(9 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Our method achieves 94.4% and outperforms the approaches based on STIPs [2], [8], [9], [33], [34]. This validates the superiority of feature trajectories over STIPs in discriminating human actions.…”

Section: A Recognition and Detection On Kth And Cmu Datasetsupporting

confidence: 69%

Detecting Human Action as the Spatio-Temporal Tube of Maximum Mutual Information

Wang

Ding

2014

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

Human action detection in complex scenes is a challenging problem due to its high-dimensional search space and dynamic backgrounds. To achieve efficient and accurate action detection, we represent a video sequence as a collection of feature trajectories and model human action as the spatiotemporal tube (ST-tube) of maximum mutual information. First, a random forest is built to evaluate the mutual information of feature trajectories toward the action class, and then a oneorder Markov model is introduced to recursively infer the action regions at consecutive frames. By exploring the time-continuity property of feature trajectories, the action region is efficiently inferred at large temporal intervals. Finally, we obtain an STtube by concatenating the consecutive action regions bounding the human bodies. Compared with the popular spatio-temporal cuboid action model, the proposed ST-tube model is not only more efficient, but also more accurate in action localization. Experimental results on the KTH, CMU and UCF sports datasets validate the superiority of our approach over the state-of-the-art methods in both localization accuracy and time efficiency.Index Terms-Action detection, feature trajectory, mutual information, spatio-temporal cuboid (ST-cuboid), spatio-temporal tube (ST-tube).

show abstract

Section: A Recognition and Detection On Kth And Cmu Datasetsupporting

confidence: 69%

Detecting Human Action as the Spatio-Temporal Tube of Maximum Mutual Information

Wang

Ding

2014

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

show abstract

“…Among these methods, the human action recognition based on the bag of words (BoW) model has achieved satisfactory results in many tasks and drawn much attention. The this framework, and several STIP detectors have been introduced [6, 7, 8,9,10,11].…”

Section: Related Workmentioning

confidence: 99%

“…The STIP detector in [6] usually suffers from sparse detection results. Later, several methods for detecting STIPs have been reported [7, 8,9,10,11]. Dollr et al…”

Section: Related Workmentioning

confidence: 99%

“…It is a spatiotemporal extension of the Hessian saliency measure, previously applied for object detection. Oikonomopoulos et al [9] proposed a spatio-temporal extension of the salient region detector, and a dense version for this STIP was proposed by Rapantzikos et al [10]. Yu and Kim [11] proposed a STIPs detector which they call V-FAST (Video FAST) by extending the FAST corners [19] into a spatiotemporal domain.…”

Section: Related Workmentioning

confidence: 99%

“…Many STIP detectors have been proposed using many models [6,7,8,9,10,11]. STIP captures characteristic shape and motion in video and provides relatively independent representation of events with respect to their spatiotemporal shifts and scales invariant as well as background clutter and multiple motions in the scene.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Human action recognition with Optimized Video Densely Sampling

Wang

Liu

Xiao

et al. 2013

2013 IEEE International Conference on Multimedia and Expo (ICME)

View full text Add to dashboard Cite

Dense sample video patches have been used for video representation in action recognition and achieve better performance than sparse spatiotemporal local features. However, two problems of this method must be considered. First one, many video patches are from background other than human body. Second one, the descriptor is not reliable, since it is neither shift nor scale invariant. To solve these two problems, we proposed an Optimized Video Dense Sampling (OVDS) method combing with dense sampling and spatiotemporal interest points detector. OVDS densely sampled video patches with optimizing the position and scale parameters to guarantee the features are shift and scale invariant. To omit the action unrelated features, we extracted video patches only from human body regions instead of the whole videos. Experimental results on KTH, Weizmann, UCF, Hoollywood2 datasets showed that the features detected by OVDS are informative and reliable for action recognition, and achieve better performance over the existing spatiotemporal local features.Index Termsvideo representation, action recognition, spatiotemporal local features, dense sample, shift and scale invariant

show abstract