Temporal dynamics of postures over time is crucial for sequence-based action recognition. Human actions can be represented by the corresponding motions of articulated skeleton. Most of the existing approaches for skeleton based action recognition model the spatial-temporal evolution of actions based on hand-crafted features. As a kind of hierarchically adaptive filter banks, Convolutional Neural Network (CNN) performs well in representation learning. In this paper, we propose an end-to-end hierarchical architecture for skeleton based action recognition with CNN. Firstly, we represent a skeleton sequence as a matrix by concatenating the joint coordinates in each instant and arranging those vector representations in a chronological order. Then the matrix is quantified into an image and normalized to handle the variable-length problem. The final image is fed into a CNN model for feature extraction and recognition. For the specific structure of such images, the simple max-pooling plays an important role on spatial feature selection as well as temporal frequency adjustment, which can obtain more discriminative joint information for different actions and meanwhile address the variable-frequency problem. Experimental results demonstrate that our method achieves the state-of-art performance with high computational efficiency, especially surpassing the existing result by more than 15 percentage on the challenging ChaLearn gesture recognition dataset.
One key challenging issue of facial expression recognition is to capture the dynamic variation of facial physical structure from videos. In this paper, we propose a part-based hierarchical bidirectional recurrent neural network (PHRNN) to analyze the facial expression information of temporal sequences. Our PHRNN models facial morphological variations and dynamical evolution of expressions, which is effective to extract "temporal features" based on facial landmarks (geometry information) from consecutive frames. Meanwhile, in order to complement the still appearance information, a multi-signal convolutional neural network (MSCNN) is proposed to extract "spatial features" from still frames. We use both recognition and verification signals as supervision to calculate different loss functions, which are helpful to increase the variations of different expressions and reduce the differences among identical expressions. This deep evolutional spatial-temporal network (composed of PHRNN and MSCNN) extracts the partial-whole, geometry-appearance, and dynamic-still information, effectively boosting the performance of facial expression recognition. Experimental results show that this method largely outperforms the state-of-the-art ones. On three widely used facial expression databases (CK+, Oulu-CASIA, and MMI), our method reduces the error rates of the previous best ones by 45.5%, 25.8%, and 24.4%, respectively.
Human actions can be represented by the trajectories of skeleton joints. Traditional methods generally model the spatial structure and temporal dynamics of human skeleton with hand-crafted features and recognize human actions by well-designed classifiers. In this paper, considering that recurrent neural network (RNN) can model the long-term contextual information of temporal sequences well, we propose an end-to-end hierarchical RNN for skeleton based action recognition. Instead of taking the whole skeleton as the input, we divide the human skeleton into five parts according to human physical structure, and then separately feed them to five subnets. As the number of layers increases, the representations extracted by the subnets are hierarchically fused to be the inputs of higher layers. The final representations of the skeleton sequences are fed into a single-layer perceptron, and the temporally accumulated output of the perceptron is the final decision. We compare with five other deep RNN architectures derived from our model to verify the effectiveness of the proposed network, and also compare with several other methods on three publicly available datasets. Experimental results demonstrate that our model achieves the state-of-the-art performance with high computational efficiency.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.