The increasing availability of high‐dimensional, fine‐grained data about human behaviour, gathered from mobile sensing studies and in the form of digital footprints, is poised to drastically alter the way personality psychologists perform research and undertake personality assessment. These new kinds and quantities of data raise important questions about how to analyse the data and interpret the results appropriately. Machine learning models are well suited to these kinds of data, allowing researchers to model highly complex relationships and to evaluate the generalizability and robustness of their results using resampling methods. The correct usage of machine learning models requires specialized methodological training that considers issues specific to this type of modelling. Here, we first provide a brief overview of past studies using machine learning in personality psychology. Second, we illustrate the main challenges that researchers face when building, interpreting, and validating machine learning models. Third, we discuss the evaluation of personality scales, derived using machine learning methods. Fourth, we highlight some key issues that arise from the use of latent variables in the modelling process. We conclude with an outlook on the future role of machine learning models in personality research and assessment.
For decades, day–night patterns in behaviour have been investigated by asking people about their sleep–wake timing, their diurnal activity patterns, and their sleep duration. We demonstrate that the increasing digitalization of lifestyle offers new possibilities for research to investigate day–night patterns and related traits with the help of behavioural data. Using smartphone sensing, we collected in vivo data from 597 participants across several weeks and extracted behavioural day–night pattern indicators. Using this data, we explored three popular research topics. First, we focused on individual differences in day–night patterns by investigating whether ‘morning larks’ and ‘night owls’ manifest in smartphone‐sensed behavioural indicators. Second, we examined whether personality traits are related to day–night patterns. Finally, exploring social jetlag, we investigated whether traits and work weekly day–night behaviours influence day–night patterns on weekends. Our findings highlight that behavioural data play an essential role in understanding daily routines and their relations to personality traits. We discuss how psychological research can integrate new behavioural approaches to study personality.
The increasing availability of high-dimensional, fine-grained data about human behavior, gathered from mobile sensing studies and in the form of digital footprints, is poised to drastically alter the way personality psychologists perform research and undertake personality assessment. These new kinds and quantities of data raise important questions about how to analyze the data and interpret the results appropriately. Machine learning models are well-suited to these kinds of data, allowing researchers to model highly complex relationships and to evaluate the generalizability and robustness of their results using resampling methods. The correct usage of machine learning models requires specialized methodological training that considers issues specific to this type of modeling. Here, we first provide a brief overview of past studies using machine learning in personality psychology. Second, we illustrate the main challenges that researchers face when building, interpreting, and validating machine learning models. Third, we discuss the evaluation of personality scales, derived using machine learning methods. Fourth, we highlight some key issues that arise from the use of latent variables in the modeling process. We conclude with an outlook on the future role of machine learning models in personality research and assessment.
Cognitive reserve (CR) is understood as capacity to cope with challenging conditions, e.g. after brain injury or in states of brain dysfunction, or age-related cognitive decline. CR in elderly subjects has attracted much research interest, but differences between healthy older and younger subjects have not been addressed in detail hitherto. Usually, one-time standard individual assessments are used to characterise CR. Here we observe CR as individual improvement in cognitive performance (gain) in a complex testing-the-limits paradigm, the digit symbol substitution test (DSST), with 10 repeated measurements, in 140 younger (20–30 yrs) and 140 older (57–74 yrs) healthy subjects. In addition, we assessed attention, memory and executive function, and mood and personality traits as potential influence factors for CR. We found that both, younger and older subjects showed significant gains, which were significantly correlated with speed of information processing, verbal short-term memory and visual problem solving in the older group only. Gender, personality traits and mood did not significantly influence gains in either group. Surprisingly about half of the older subjects performed at the level of the younger group, suggesting that interindividual differences in CR are possibly age-independent. We propose that these findings may also be understood as indication that one-time standard individual measurements do not allow assessment of CR, and that the use of DSST in a testing-the-limits paradigm is a valuable assessment method for CR in young and elderly subjects.
Abstract. Longitudinal panels include several thousand participants and variables. Traditionally, psychologists analyze only a few variables – partly because common unregularized linear models perform poorly when the number of variables ( p) approaches the number of observations ( N). Predictive modeling methods can be used when N [Formula: see text] p situations arise in psychological research. We illustrate these techniques on exemplary variables from the German GESIS Panel, while describing the choice of preprocessing, model classes, resampling techniques, hyperparameter tuning, and performance measures. In analyses with about 2,000 subjects and variables each, we predict panelists’ gender, sick days, an evaluation of US President Trump, income, life satisfaction, and sleep satisfaction. Elastic net and random forest models were compared to dummy predictions in benchmark experiments. While good performance was achieved, the linear elastic net performed similar to the nonlinear random forest. Elastic nets were refitted to extract the ten most important predictors. Their interpretation validates our approach, and further modeling options are discussed. Code can be found at https://osf.io/zpse3/
Since most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect in data analysis. A common problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of these techniques on a subsequent algorithm’s predictive performance, and—if possible—derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting, k-nearest neighbors, support vector machine) using datasets from regression, binary- and multiclass–classification settings. In our study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results. Traditionally widely used encodings that make unreasonable assumptions to map levels to integers (e.g. integer encoding) or to reduce the number of levels (possibly based on target information, e.g. leaf encoding) before creating binary indicator variables (one-hot or dummy encoding) were not as effective in comparison.
Supervised machine learning (ML) is becoming an influential research method in psychology and other social sciences. However, theoretical ML concepts and predictive modeling techniques are not yet widely taught in psychology programs. This tutorial is intended to provide a low-barrier, non-technical entrance to supervised ML for psychologists in four consecutive modules. After introducing the basic idea of supervised ML, Module I covers performance evaluation of ML models with resampling methods (performance measures, bias-variance tradeoff, k-fold cross-validation). Module II introduces nonlinear, tree-based algorithms, focusing on random forests and their components, regression and classification trees. Module III is about performing empirical benchmark experiments (comparing the performance of several ML algorithms on multiple datasets). Finally, Module IV discusses the interpretation of ML models, including permutation variable importance measures, effect plots (partial dependence plots, individual conditional expectation profiles, accumulated local effect plots), and the concept of model fairness. Throughout the tutorial, intuitive descriptions of theoretical concepts (with as few mathematical formulas as possible) are followed by code examples, using the mlr3 and companion packages in R. Key practical analysis steps are demonstrated on the publicly available PhoneStudy dataset (N = 624), which includes over 1800 variables from smartphone sensing to predict Big Five personality trait scores. The manuscript contains a checklist to be used as a reminder on important aspects when performing, reporting, or reviewing ML analyses in psychology. Additional examples and more advanced concepts are demonstrated in extensive online materials (https://osf.io/9273g/).
Psychology has seen an increase in machine learning (ML) methods. In many applications, observations are classified into one of two groups (binary classification). Off-the-shelf classification algorithms assume that the costs of a misclassification (false-positive or false-negative) are equal. Because this is often not reasonable (e.g., in clinical psychology), cost-sensitive learning (CSL) methods can take different cost ratios into account. We present the mathematical foundations and introduce a taxonomy of the most commonly used CSL methods, before demonstrating their application and usefulness on psychological data, i.e., the drug consumption dataset ($N = 1885$) from the UCI Machine Learning Repository. In our example, all demonstrated CSL methods noticeably reduce mean misclassification costs compared to regular ML algorithms. We discuss the necessity for researchers to perform small benchmarks of CSL methods for their own practical application. Thus, our open materials provide R code, demonstrating how CSL methods can be applied within the mlr3 framework (https://osf.io/cvks7/).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.