Dense sample video patches have been used for video representation in action recognition and achieve better performance than sparse spatiotemporal local features. However, two problems of this method must be considered. First one, many video patches are from background other than human body. Second one, the descriptor is not reliable, since it is neither shift nor scale invariant. To solve these two problems, we proposed an Optimized Video Dense Sampling (OVDS) method combing with dense sampling and spatiotemporal interest points detector. OVDS densely sampled video patches with optimizing the position and scale parameters to guarantee the features are shift and scale invariant. To omit the action unrelated features, we extracted video patches only from human body regions instead of the whole videos. Experimental results on KTH, Weizmann, UCF, Hoollywood2 datasets showed that the features detected by OVDS are informative and reliable for action recognition, and achieve better performance over the existing spatiotemporal local features.Index Termsvideo representation, action recognition, spatiotemporal local features, dense sample, shift and scale invariant