Determining the number of factors is one of the most crucial decisions a researcher has to face when conducting an exploratory factor analysis. As no common factor retention criterion can be seen as generally superior, a new approach is proposed -combining extensive data simulation with state-of-the-art machine learning algorithms. First, data was simulated under a broad range of realistic conditions and three algorithms were trained using specially designed features based on the correlation matrices of the simulated data sets. Subsequently, the new approach was compared to four common factor retention criteria with regard to its accuracy in determining the correct number of factors in a large-scale simulation experiment.Sample size, variables per factor, correlations between factors, primary and cross-loadings as well as the correct number of factors were varied to gain comprehensive knowledge of the efficiency of our new method. A gradient boosting model outperformed all other criteria, so in a second step, we improved this model by tuning several hyperparameters of the algorithm and using common retention criteria as additional features. This model reached an out-of-sample accuracy of 99.3% (the pre-trained model can be obtained from https://osf.io/mvrau/). A great advantage of this approach is the possibility to continuously extend the data basis (e.g. using ordinal data) as well as the set of features to improve the predictive performance and to increase generalizability.
Exploratory factor analysis is a statistical method commonly used in psychological research to investigate latent variables and to develop questionnaires. Although such self-report questionnaires are prone to missing values, there is not much literature on this topic with regard to exploratory factor analysis—and especially the process of factor retention. Determining the correct number of factors is crucial for the analysis, yet little is known about how to deal with missingness in this process. Therefore, in a simulation study, six missing data methods (an expectation–maximization algorithm, predictive mean matching, Bayesian regression, random forest imputation, complete case analysis, and pairwise complete observations) were compared with respect to the accuracy of the parallel analysis chosen as retention criterion. Data were simulated for correlated and uncorrelated factor structures with two, four, or six factors; 12, 24, or 48 variables; 250, 500, or 1,000 observations and three different missing data mechanisms. Two different procedures combining multiply imputed data sets were tested. The results showed that no missing data method was always superior, yet random forest imputation performed best for the majority of conditions—in particular when parallel analysis was applied to the averaged correlation matrix rather than to each imputed data set separately. Complete case analysis and pairwise complete observations were often inferior to multiple imputation.
Determining the number of factors is one of the most crucial decisions a researcher has to face when conducting an exploratory factor analysis. As no common factor retention criterion can be seen as generally superior, a new approach is proposed - combining extensive data simulation with state-of-the-art machine learning algorithms. First, data was simulated under a broad range of realistic conditions and three algorithms were trained using specially designed features based on the correlation matrices of the simulated data sets. Subsequently, the new approach was compared to four common factor retention criteria with regard to its accuracy in determining the correct number of factors in a large-scale simulation experiment. Sample size, variables per factor, correlations between factors, primary and cross-loadings as well as the correct number of factors were varied to gain comprehensive knowledge of the efficiency of our new method. A gradient boosting model outperformed all other criteria, so in a second step, we improved this model by tuning several hyperparameters of the algorithm and using common retention criteria as additional features. This model reached an out-of-sample accuracy of 99.3% (the pre-trained model can be obtained from https://osf.io/mvrau/). A great advantage of this approach is the possibility to continuously extend the data basis (e.g. using ordinal data) as well as the set of features to improve the predictive performance and to increase generalizability.
Machine learning (ML) provides a powerful framework for the analysis of high‐dimensional datasets by modelling complex relationships, often encountered in modern data with many variables, cases and potentially non‐linear effects. The impact of ML methods on research and practical applications in the educational sciences is still limited, but continuously grows, as larger and more complex datasets become available through massive open online courses (MOOCs) and large‐scale investigations. The educational sciences are at a crucial pivot point, because of the anticipated impact ML methods hold for the field. To provide educational researchers with an elaborate introduction to the topic, we provide an instructional summary of the opportunities and challenges of ML for the educational sciences, show how a look at related disciplines can help learning from their experiences, and argue for a philosophical shift in model evaluation. We demonstrate how the overall quality of data analysis in educational research can benefit from these methods and show how ML can play a decisive role in the validation of empirical models. Specifically, we (1) provide an overview of the types of data suitable for ML and (2) give practical advice for the application of ML methods. In each section, we provide analytical examples and reproducible R code. Also, we provide an extensive Appendix on ML‐based applications for education. This instructional summary will help educational scientists and practitioners to prepare for the promises and threats that come with the shift towards digitisation and large‐scale assessment in education.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.