Fully-Automatic Facial Expression Recognition (FER) from still images is a challenging task as it involves handling large interpersonal morphological differences, and as partial occlusions can occasionally happen. Furthermore, labelling expressions is a time-consuming process that is prone to subjectivity, thus the variability may not be fully covered by the training data. In this work, we propose to train Random Forests upon spatially defined local subspaces of the face. The output local predictions form a categorical expression-driven high-level representation that we call Local Expression Predictions (LEPs). LEPs can be combined to describe categorical facial expressions as well as Action Units (AUs). Furthermore, LEPs can be weighted by confidence scores provided by an autoencoder network. Such network is trained to locally capture the manifold of the non-occluded training data in a hierarchical way. Extensive experiments show that the proposed LEP representation yields high descriptive power for categorical expressions and AU occurrence prediction, and leads to interesting perspectives towards the design of occlusion-robust and confidence-aware FER systems.
Face Alignment is an active computer vision domain, that consists in localizing a number of facial landmarks that vary across datasets. State-of-the-art face alignment methods either consist in end-to-end regression, or in refining the shape in a cascaded manner, starting from an initial guess. In this paper, we introduce DeCaFA, an end-to-end deep convolutional cascade architecture for face alignment. DeCaFA uses fully-convolutional stages to keep full spatial resolution throughout the cascade. Between each cascade stage, DeCaFA uses multiple chained transfer layers with spatial softmax to produce landmark-wise attention maps for each of several landmark alignment tasks. Weighted intermediate supervision, as well as efficient feature fusion between the stages allow to learn to progressively refine the attention maps in an end-to-end manner. We show experimentally that DeCaFA significantly outperforms existing approaches on 300W, CelebA and WFLW databases. In addition, we show that DeCaFA can learn fine alignment with reasonable accuracy from very few images using coarsely annotated data.
Facial expression can be seen as the dynamic variation of one's appearance over time. Successful recognition thus involves finding representations of high-dimensional spatiotemporal patterns that can be generalized to unseen facial morphologies and variations of the expression dynamics. In this paper, we propose to learn Random Forests from heterogeneous derivative features (e.g. facial fiducial point movements or texture variations) upon pairs of images. Those forests are conditioned on the expression label of the first frame to reduce the variability of the ongoing expression transitions. When testing on a specific frame of a video, pairs are created between this current frame and the previous ones. Predictions for each previous frame are used to draw trees from Pairwise Conditional Random Forests (PCRF) whose pairwise outputs are averaged over time to produce robust estimates. As such, PCRF appears as a natural extension of Random Forests to learn spatio-temporal patterns, that leads to significant improvements over standard Random Forests as well as state-of-the-art approaches on several facial expression benchmarks.
The production of facial expressions (FEs) is an important skill that allows children to share and adapt emotions with their relatives and peers during social interactions. These skills are impaired in children with Autism Spectrum Disorder. However, the way in which typical children develop and master their production of FEs has still not been clearly assessed. This study aimed to explore factors that could influence the production of FEs in childhood such as age, gender, emotion subtype (sadness, anger, joy, and neutral), elicitation task (on request, imitation), area of recruitment (French Riviera and Parisian) and emotion multimodality. A total of one hundred fifty-seven children aged 6–11 years were enrolled in Nice and Paris, France. We asked them to produce FEs in two different tasks: imitation with an avatar model and production on request without a model. Results from a multivariate analysis revealed that: (1) children performed better with age. (2) Positive emotions were easier to produce than negative emotions. (3) Children produced better FE on request (as opposed to imitation); and (4) Riviera children performed better than Parisian children suggesting regional influences on emotion production. We conclude that facial emotion production is a complex developmental process influenced by several factors that needs to be acknowledged in future research.
Background: Computer vision combined with human annotation could offer a novel method for exploring facial expression (FE) dynamics in children with autism spectrum disorder (ASD). Methods:We recruited 157 children with typical development (TD) and 36 children with ASD in Paris and Nice to perform two experimental tasks to produce FEs with emotional valence. FEs were explored by judging ratings and by random forest (RF) classifiers. To do so, we located a set of 49 facial landmarks in the task videos, we generated a set of geometric and appearance features and we used RF classifiers to explore how children with ASD differed from TD children when producing FEs.Results: Using multivariate models including other factors known to predict FEs (age, gender, intellectual quotient, emotion subtype, cultural background), ratings from expert raters showed that children with ASD had more difficulty producing FEs than TD children. In addition, when we explored how RF classifiers performed, we found that classification tasks, except for those for sadness, were highly accurate and that RF classifiers needed more facial landmarks to achieve the best classification for children with ASD. Confusion matrices showed that when RF classifiers were tested in children with ASD, anger was often confounded with happiness.Limitations: The sample size of the group of children with ASD was lower than that of the group of TD children. By using several control calculations, we tried to compensate for this limitation.Conclusion: Children with ASD have more difficulty producing socially meaningful FEs. The computer vision methods we used to explore FE dynamics also highlight that the production of FEs in children with ASD carries more ambiguity.
Deep learning approaches are nowadays ubiquitously used to tackle computer vision tasks such as semantic segmentation, requiring large datasets and substantial computational power. Continual learning for semantic segmentation (CSS) is an emerging trend that consists in updating an old model by sequentially adding new classes. However, continual learning methods are usually prone to catastrophic forgetting. This issue is further aggravated in CSS where, at each step, old classes from previous iterations are collapsed into the background. In this paper, we propose Local POD, a multi-scale pooling distillation scheme that preserves long-and short-range spatial relationships at feature level. Furthermore, we design an entropy-based pseudo-labelling of the background w.r.t. classes predicted by the old model to deal with background shift and avoid catastrophic forgetting of the old classes. Our approach, called PLOP, significantly outperforms state-of-the-art methods in existing CSS scenarios, as well as in newly proposed challenging benchmarks.
Automatic facial expression classification is a challenging problem for developing intelligent human-computer interaction systems. In order to take into account the expression dynamics, existing works usually make the assumption that a specific facial expression is displayed with a pre-segmented evolution, i.e. starting from neutral and finishing on an apex frame. In this paper, we propose a method to train a transition classifier from pairs of images. This transition classifier is applied at multiple time gaps and the output probabilities are fused along with a static estimation. We eventually show that our approach yields state-of-the-art accuracy on popular datasets without exploiting any such prior on the segmentation of the expression, thus effectively bridging the gap towards facial expression recognition in unconstrained environments.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.