This work presents a novel approach to the problem of real-time human action recognition in intelligent video surveillance. For more efficient and precise labeling of an action, this work proposes a multilevel action descriptor, which delivers complete information of human actions. The action descriptor consists of three levels: posture, locomotion, and gesture level; each of which corresponds to a different group of subactions describing a single human action, for example, smoking while walking. The proposed action recognition method is able to localize and recognize simultaneously the actions of multiple individuals using appearance-based temporal features with multiple convolutional neural networks (CNN). Although appearance cues have been successfully exploited for visual recognition problems, appearance, motion history, and their combined cues with multi-CNNs have not yet been explored. Additionally, the first systematic estimation of several hyperparameters for shape and motion history cues is investigated. The proposed approach achieves a mean average precision (mAP) of 73.2% in the frame-based evaluation over the newly collected large-scale ICVL video dataset. The action recognition model can run at around 25 frames per second, which is suitable for real-time surveillance applications.