Procedural Generation of Videos to Train Deep Action Recognition Networks

Souza, César Roberto de; Gaidon, Adrien; Cabon, Yohann; López, Antonio

doi:10.1109/cvpr.2017.278

Cited by 103 publications

(76 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…To generalize to in-the-wild images, Mehta et al [50] proposed a 2D-to-3D knowledge transfer, i.e., using pre-trained 2D pose networks to initialize the 3D pose regression networks while in [51] the common representations between the 2D and the 3D tasks are shared. To compensate for the lack of large scale in-the-wild datasets, recent work has also proposed to generate training images for particular 3D pose datasets such as the CMU MoCap dataset [6] by stitching image regions [8], animating human 3D models [7], [52], using a game engine [53] or by rendering textured 3D body scans [54], [55]. These synthetic datasets have proved to be useful for training CNN architectures, yet often requiring a domain adaptation stage.…”

Section: D Human Pose From a Single Imagementioning

confidence: 99%

LCR-Net++: Multi-person 2D and 3D Pose Detection in Natural Images

Rogez

Weinzaepfel²,

Schmid

2019

IEEE Trans. Pattern Anal. Mach. Intell.

225

215

View full text Add to dashboard Cite

We propose an end-to-end architecture for joint 2D and 3D human pose estimation in natural images. Key to our approach is the generation and scoring of a number of pose proposals per image, which allows us to predict 2D and 3D poses of multiple people simultaneously. Hence, our approach does not require an approximate localization of the humans for initialization. Our Localization-Classification-Regression architecture, named LCR-Net, contains 3 main components: 1) the pose proposal generator that suggests candidate poses at different locations in the image; 2) a classifier that scores the different pose proposals; and 3) a regressor that refines pose proposals both in 2D and 3D. All three stages share the convolutional feature layers and are trained jointly. The final pose estimation is obtained by integrating over neighboring pose hypotheses, which is shown to improve over a standard non maximum suppression algorithm. Our method recovers full-body 2D and 3D poses, hallucinating plausible body parts when the persons are partially occluded or truncated by the image boundary. Our approach significantly outperforms the state of the art in 3D pose estimation on Human3.6M, a controlled environment. Moreover, it shows promising results on real images for both single and multi-person subsets of the MPII 2D pose benchmark and demonstrates satisfying 3D pose results even for multi-person images.

show abstract

Section: D Human Pose From a Single Imagementioning

confidence: 99%

LCR-Net++: Multi-person 2D and 3D Pose Detection in Natural Images

Rogez

Weinzaepfel²,

Schmid

2019

IEEE Trans. Pattern Anal. Mach. Intell.

225

215

View full text Add to dashboard Cite

show abstract

“…Another increasingly popular way to overcome the lack of large-scale dataset is explored by the usage of synthetic data, such as VEIS [28], SYNTHIA [29], Virtual KITTI [30], and GTA-V [31]. Synthetic data is usually used to augment real training data [29], [32]. The SYNTHIA dataset is generated by rendering a virtual city created with the Unity development platform for semantic segmentation of driving scenes.…”

Section: Related Workmentioning

confidence: 99%

Restricted Deformable Convolution-Based Road Scene Semantic Segmentation Using Surround View Cameras

Deng

Yang

et al. 2020

IEEE Trans. Intell. Transport. Syst.

108

View full text Add to dashboard Cite

Understanding the surrounding environment of the vehicle is still one of the challenges for autonomous driving. This paper addresses 360-degree road scene semantic segmentation using surround view cameras, which are widely equipped in existing production cars. First, in order to address large distortion problem in the fisheye images, Restricted Deformable Convolution (RDC) is proposed for semantic segmentation, which can effectively model geometric transformations by learning the shapes of convolutional filters conditioned on the input feature map. Second, in order to obtain a large-scale training set of surround view images, a novel method called zoom augmentation is proposed to transform conventional images to fisheye images. Finally, an RDC based semantic segmentation model is built; the model is trained for real-world surround view images through a multi-task learning architecture by combining real-world images with transformed images. Experiments demonstrate the effectiveness of the RDC to handle images with large distortions, and that the proposed approach shows a good performance using surround view cameras with the help of the transformed images.

show abstract

“…Khodabandeh et al [13] provide a method to automatically generate an action recognition dataset by partitioning a video into action, subject and context. Souza et al [14] proposed a database of simulated human actions. They used motion capture data containing action annotations combined with 3D human models in a simulated environment and show that it improves action recognition rates when combined with a small amount of annotated real world data.…”

Section: Related Workmentioning

confidence: 99%

Simple yet efficient real-time pose-based action recognition

Ludl

Gulde

Curio

2019

2019 IEEE Intelligent Transportation Systems Conference (ITSC)

View full text Add to dashboard Cite

Recognizing human actions is a core challenge for autonomous systems as they directly share the same space with humans. Systems must be able to recognize and assess human actions in real-time. In order to train corresponding data-driven algorithms, a significant amount of annotated training data is required. We demonstrated a pipeline to detect humans, estimate their pose, track them over time and recognize their actions in real-time with standard monocular camera sensors. For action recognition, we encode the human pose into a new data format called Encoded Human Pose Image (EHPI) that can then be classified using standard methods from the computer vision community. With this simple procedure we achieve competitive state-of-the-art performance in pose-based action detection and can ensure real-time performance. In addition, we show a use case in the context of autonomous driving to demonstrate how such a system can be trained to recognize human actions using simulation data.

show abstract

Procedural Generation of Videos to Train Deep Action Recognition Networks

Cited by 103 publications

References 53 publications

LCR-Net++: Multi-person 2D and 3D Pose Detection in Natural Images

LCR-Net++: Multi-person 2D and 3D Pose Detection in Natural Images

Restricted Deformable Convolution-Based Road Scene Semantic Segmentation Using Surround View Cameras

Simple yet efficient real-time pose-based action recognition

Contact Info

Product

Resources

About