Although deep learning approaches have stood out in recent years due to their state-of-the-art results, they continue to suffer from catastrophic forgetting, a dramatic decrease in overall performance when training with new classes added incrementally. This is due to current neural network architectures requiring the entire dataset, consisting of all the samples from the old as well as the new classes, to update the model-a requirement that becomes easily unsustainable as the number of classes grows. We address this issue with our approach to learn deep neural networks incrementally, using new data and only a small exemplar set corresponding to samples from the old classes. This is based on a loss composed of a distillation measure to retain the knowledge acquired from the old classes, and a cross-entropy loss to learn the new classes. Our incremental training is achieved while keeping the entire framework end-to-end, i.e., learning the data representation and the classifier jointly, unlike recent methods with no such guarantees. We evaluate our method extensively on the CIFAR-100 and Im-ageNet (ILSVRC 2012) image classification datasets, and show state-of-the-art performance.
This work targets people identification in video based on the way they walk (i.e. gait). While classical methods typically derive gait signatures from sequences of binary silhouettes, in this work we explore the use of convolutional neural networks (CNN) for learning high-level descriptors from low-level motion features (i.e. optical flow components). We carry out a thorough experimental evaluation of the proposed CNN architecture on the challenging TUM-GAID dataset. The experimental results indicate that using spatio-temporal cuboids of optical flow as input data for CNN allows to obtain state-of-the-art results on the gait task with an image resolution eight times lower than the previously reported results (i.e. 80 × 60 pixels).
People identification using gait information (i.e., the way a person walks) obtained from inertial sensors is a robust approach that can be used in multiple situations where vision-based systems are not applicable. Typically, previous methods use hand-crafted features or deep learning approaches with pre-processed features as input. In contrast, we present a new deep learning-based end-to-end approach that employs raw inertial data as input. By this way, our approach is able to automatically learn the best representations without any constraint introduced by the pre-processed features. Moreover, we study the fusion of information from multiple inertial sensors and multi-task learning from multiple labels per sample. Our proposal is experimentally validated on the challenging dataset OU-ISIR, which is the largest available dataset for gait recognition using inertial information. After conducting an extensive set of experiments to obtain the best hyper-parameters, our approach is able to achieve state-of-the-art results. Specifically, we improve both the identification accuracy (from 83.8% to 94.8%) and the authentication equal-error-rate (from 5.6 to 1.1). Our experimental results suggest that: 1) the use of hand-crafted features is not necessary for this task as deep learning approaches using raw data achieve better results; 2) the fusion of information from multiple sensors allows to improve the results; and, 3) multi-task learning is able to produce a single model that obtains similar or even better results in multiple tasks than the corresponding models trained for a single task.
People identification in video based on the way they walk (i.e. gait) is a relevant task in computer vision using a non-invasive approach. Standard and current approaches typically derive gait signatures from sequences of binary energy maps of subjects extracted from images, but this process introduces a large amount of non-stationary noise, thus, conditioning their efficacy. In contrast, in this paper we focus on the raw pixels, or simple functions derived from them, letting advanced learning techniques to extract relevant features. Therefore, we present a comparative study of different Convolutional Neural Network (CNN) architectures on three low-level features (i.e. gray pixels, optical flow channels and depth maps) on two widely-adopted and challenging datasets: TUM-GAID and CASIA-B. In addition, we perform a comparative study between different early and late fusion methods used to combine the information obtained from each kind of low-level features. Our experimental results suggest that (i ) the use of hand-crafted energy maps (e.g. GEI) is not necessary, since equal or better results can be achieved from the raw pixels; (ii ) the combination of multiple modalities (i.e. gray pixels, optical flow and depth maps) from different CNNs allows to obtain state-of-the-art results on the gait task with an image resolution several times smaller than the previously reported results; and, (iii ) the selection of the architecture is a critical point that can make the difference between state-of-the-art results or poor results. He et al.[10] proposed a new kind of CNN, named ResNet, which has a large number of convolutional layers and 'residual connections' to avoid the vanishing gradient problem.Although several papers can be found for the task of human action recognition using DL techniques, few works apply DL to the problem of gait recognition. In [22], Hossain and Chetty propose the use of Restricted Boltzmann Machines to extract gait features from binary silhouettes, but a very small probe set (i.e. only ten different subjects) were used for validating their approach. A more recent work, [23], uses a random set of binary silhouettes of a sequence to train a CNN that accumulates the calculated features in order to achieve a global representation of the dataset. In [24], raw 2D GEI are employed to train an ensemble of CNN, where a Multilayer Perceptron (MLP) is used as classifier. Similarly, in [25] a multilayer CNN is trained with GEI data. A novel approach based on GEI is developed on [8], where the CNN is trained with pairs of gallery-probe samples and using a distance metric. Castro et al.[26] use optical flow obtained from raw data frames. An in-dept evaluation of different CNN architectures based on optical flow maps is presented in [27]. Finally, in [28] a multitask CNN with a combined loss function with multiple kind of labels is presented.Despite most CNNs are trained with visual data (e.g. images or videos), there are some works that build CNNs for different kinds of data like inertial sensors or human skel...
Abstract-The goal of this paper is to identify individuals by analyzing their gait. Instead of using binary silhouettes as input data (as done in many previous works) we propose and evaluate the use of motion descriptors based on densely sampled short-term trajectories. We take advantage of state-ofthe-art people detectors to define custom spatial configurations of the descriptors around the target person. Thus, obtaining a pyramidal representation of the gait motion. The local motion features (described by the Divergence-Curl-Shear descriptor [1]) extracted on the different spatial areas of the person are combined into a single high-level gait descriptor by using the Fisher Vector encoding [2]. The proposed approach, coined Pyramidal Fisher Motion, is experimentally validated on the recent 'AVA Multiview Gait' dataset [3]. The results show that this new approach achieves promising results in the problem of gait recognition.
The goal of this paper is to identify individuals by analyzing their gait. Instead of using binary silhouettes as input data (as done in many previous works) we propose and evaluate the use of motion descriptors based on densely sampled short-term trajectories. We take advantage of state-of-the-art people detectors to define custom spatial configurations of the descriptors around the target person, obtaining a rich representation of the gait motion. The local motion features (described by the Divergence-Curl-Shear descriptor
This work targets people identification in video based on the way they walk (i.e.gait) by using deep learning architectures. We explore the use of convolutional neural networks (CNN) for learning high-level descriptors from low-level motion features (i.e.optical flow components). The low number of training samples for each subject and the use of a test set containing subjects different from the training ones makes the search of a good CNN architecture a challenging task. We carry out a thorough experimental evaluation deploying and analyzing four distinct CNN models with different depth but similar complexity. We show that even the simplest CNN models greatly improve the results using shallow classifiers. All our experiments have been carried out on the challenging TUM-GAID dataset, which contains people in different covariate scenarios (i.e.clothing, shoes, bags).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.