baseline systems on the three proposed tasks: state-of-mind recognition, depression assessment with AI, and cross-cultural affect sensing, respectively.
Depression is a serious mental disorder affecting millions of people all over the world. Traditional clinical diagnosis methods are subjective, complicated and require extensive participation of clinicians. Recent advances in automatic depression analysis systems promise a future where these shortcomings are addressed by objective, repeatable, and readily available diagnostic tools to aid health professionals in their work. Yet there remain a number of barriers to the development of such tools. One barrier is that existing automatic depression analysis algorithms base their predictions on very brief sequential segments, sometimes as little as one frame. Another barrier is that existing methods do not take into account what the context of the measured behaviour is. In this paper, we extract multi-scale video-level features for video-based automatic depression analysis. We propose to use automatically detected human behaviour primitives as the low-dimensional descriptor for each frame. We also propose two novel spectral representations, i.e. spectral heatmaps and spectral vectors, to represent video-level multi-scale temporal dynamics of expressive behaviour. Constructed spectral representations are fed to Convolution Neural Networks (CNNs) and Artificial Neural Networks (ANNs) for depression analysis. We conducted experiments on the AVEC 2013 and AVEC 2014 benchmark datasets to investigate the influence of interview tasks on depression analysis. In addition to achieving state of the art accuracy in severity of depression estimation, we show that the task conducted by the user matters, that fusion of a combination of tasks reaches highest accuracy, and that longer tasks are more informative than shorter tasks, up to a point.
This paper aims to solve two important issues that frequently occur in existing automatic personality analysis systems: 1. Attempting to use very short video segments or even single frames, rather than long-term behaviour, to infer personality traits; 2. Lack of methods to encode person-specific facial dynamics for personality recognition. To deal with these issues, this paper firstly proposes a novel Rank Loss which utilizes the natural temporal evolution of facial actions, rather than personality labels, for self-supervised learning of facial dynamics. Our approach first trains a generic U-net style model that can infer general facial dynamics learned from a set of unlabelled face videos. Then, the generic model is frozen, and a set of intermediate filters are incorporated into this architecture. The self-supervised learning is then resumed with only person-specific videos. This way, the learned filters' weights are person-specific, making them a valuable source for modeling person-specific facial dynamics. We then propose to concatenate the weights of the learned filters as a person-specific representation, which can be directly used to predict the personality traits without needing other parts of the network. We evaluate the proposed approach on both self-reported personality and apparent personality datasets. In addition to achieving promising results in the estimation of personality trait scores from videos, we show that the tasks conducted by the subject in the video matters, that fusion of a combination of tasks reaches highest accuracy, and that multi-scale dynamics are more informative than single-scale dynamics.
The performance of speaker-related systems usually degrades heavily in practical applications largely due to the presence of background noise. To improve the robustness of such systems in unknown noisy environments, this paper proposes a simple pre-processing method called Noise Invariant Frame Selection (NIFS). Based on several noisy constraints, it selects noise invariant frames from utterances to represent speakers. Experiments conducted on the TIMIT database showed that the NIFS can significantly improve the performance of Vector Quantization (VQ), Gaussian Mixture Model-Universal Background Model (GMM-UBM) and i-vector-based speaker verification systems in different unknown noisy environments with different SNRs, in comparison to their baselines. Meanwhile, the proposed NIFS-based speaker verification systems achieves similar performance when we change the constraints (hyperparameters) or features, which indicates that it is robust and easy to reproduce. Since NIFS is designed as a general algorithm, it could be further applied to other similar tasks.
The EmoPain 2020 Challenge is the first international competition aimed at creating a uniform platform for the comparison of multi-modal machine learning and multimedia processing methods of chronic pain assessment from human expressive behaviour, and also the identification of pain-related behaviours. The objective of the challenge is to promote research in the development of assistive technologies that help improve the quality of life for people with chronic pain via real-time monitoring and feedback to help manage their condition and remain physically active. The challenge also aims to encourage the use of the relatively underutilised, albeit vital bodily expression signals for automatic pain and painrelated emotion recognition. This paper presents a description of the challenge, competition guidelines, bench-marking dataset, and the baseline systems' architecture and performance on the Challenge's three sub-tasks: pain estimation from facial expressions, pain recognition from multimodal movement, and protective movement behaviour detection.
Objective. Understanding the cognitive load of drivers is crucial for road safety. Brain sensing has the potential to provide an objective measure of driver cognitive load. We aim to develop an advanced machine learning framework for classifying driver cognitive load using functional near-infrared spectroscopy (fNIRS). Approach. We conducted a study using fNIRS in a driving simulator with the N-back task used as a secondary task to impart structured cognitive load on drivers. To classify different driver cognitive load levels, we examined the application of convolutional autoencoder (CAE) and Echo State Network (ESN) autoencoder for extracting features from fNIRS. Main results. By using CAE, the accuracies for classifying two and four levels of driver cognitive load with the 30 s window were 73.25% and 47.21%, respectively. The proposed ESN autoencoder achieved state-of-art classification results for group-level models without window selection, with accuracies of 80.61% and 52.45% for classifying two and four levels of driver cognitive load. Significance. This work builds a foundation for using fNIRS to measure driver cognitive load in real-world applications. Also, the results suggest that the proposed ESN autoencoder can effectively extract temporal information from fNIRS data and can be useful for other fNIRS data classification tasks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.