Convolutional autoencoders (CAEs) are unsupervised feature extractors for high-resolution images. In the preprocessing step, whitening transformation has widely been adopted to remove redundancy by making adjacent pixels less correlated. Pooling is a biologically inspired operation to reduce the resolution of feature maps and achieve spatial invariance in convolutional neural networks. Conventionally, pooling methods are mainly determined empirically in most previous work. Therefore, our main purpose is to study the relationship between whitening processing and pooling operations in convolutional autoencoders for image classification. We propose an adaptive pooling approach based on the concepts of information entropy to test the effect of whitening on pooling in different conditions. Experimental results on benchmark datasets indicate that the performance of pooling strategies is associated with the distribution of feature activations, which can be affected by whitening processing. This provides guidance for the selection of pooling methods in convolutional autoencoders and other convolutional neural networks.
The convolutional neural network can extract local features of text but cannot capture structure information or semantic relationships between words, and a single CNN model’s classification performance is low, whereas GRU can effectively extract semantic information and global structure relationships of text. To address this problem, this paper proposes a news text classification method based on the GRU_CNN model, which combines the advantages of CNN and GRU. The model first trains word vectors as the embedding layer with the Word2vec model and then extracts semantic information from text sentences with the GRU model. Following that, this model employs the CNN model to extract crucial semantic information features and finally completes the classification through the Softmax layer. The experimental results reveal that the GRU_CNN hybrid model outperforms single CNN, LSTM, and GRU models in terms of classification effect and accuracy.
The task of human hand trajectory tracking and gesture trajectory recognition based on synchronized color and depth video is considered. Toward this end, in the facet of hand tracking, a joint observation model with the hand cues of skin saliency, motion and depth is integrated into particle filter in order to move particles to local peak in the likelihood. The proposed hand tracking method, namely, salient skin, motion, and depth based particle filter (SSMD-PF), is capable of improving the tracking accuracy considerably, in the context of the signer performing the gesture toward the camera device and in front of moving, cluttered backgrounds. In the facet of gesture recognition, a shape-order context descriptor on the basis of shape context is introduced, which can describe the gesture in spatiotemporal domain. The efficient shape-order context descriptor can reveal the shape relationship and embed gesture sequence order information into descriptor. Moreover, the shape-order context leads to a robust score for gesture invariant. Our approach is complemented with experimental results on the settings of the challenging hand-signed digits datasets and American sign language dataset, which corroborate the performance of the novel techniques.
.The real-time detection and recognition ability of human action recognition in a video surveillance system is a key problem in an intelligent surveillance system. Because the behavior recognition for video surveillance systems is affected by the complexity of the scene, the classification performance of the behavior recognition models is not satisfactory. To increase the processing efficiency of the network and solve the problem of low classification accuracy of human action recognition, we designed a deep learning model based on three-dimensional (3D) convolutional network multiscale feature fusion to reduce the impact of constant appearance changes, background clutter, and pedestrian occlusion. The model alternately uses 3D convolution and 3D pooling operations to extract temporal and spatial features between consecutive frames after data preprocessing, and then uses a feature pyramid structure to select three sets of feature layers with different scales. The model performs deconvolution operations in a bottom to up order and fuses with the features of the previous layer, then downsampling and high-level feature layer fusion are performed sequentially from top to bottom. Using the newly generated the highest-level feature layer to realize abnormal behavior recognition. The C3D network algorithm based on feature fusion proposed in this paper is compared with the three most advanced methods of C3D, R3D, and R ( 2 + 1 ) D on the pedestrian abnormal action recognition (PAAR) dataset and the same parameters, and the accuracy is significantly improved.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.