Abstract:High dimensional data are rapidly growing in many domains due to the development of technological advances which helps collect data with a large number of variables to better understand a given phenomenon of interest. Particular examples appear in genomics, fMRI data analysis, large-scale healthcare analytics, text/image analysis and astronomy. In the last two decades regularisation approaches have become the methods of choice for analysing such high dimensional data. This paper aims to study the performance o… Show more
“…Identification of a best subset of variables is known to be problematic when the number of explanatory variables is large with respect to the number of subjects and when multicollinearity is present within the data 1 . In this situation, despite their widespread use, it is recognised that selection methods based on exploratory or stepwise procedures using P-values or likelihood-based methods have notable deficiencies including producing inflated coefficient estimates and downward biased errors [1][2][3][4] . This generally results in models that are over fit and with a relatively high number of variables remaining in a 'final' model rather than a sparse model that contains only variables with the greatest association with the outcome 1 .…”
Variable selection in inferential modelling is problematic when the number of variables is large relative to the number of data points, especially when multicollinearity is present. A variety of techniques have been described to identify 'important' subsets of variables from within a large parameter space but these may produce different results which creates difficulties with inference and reproducibility. our aim was evaluate the extent to which variable selection would change depending on statistical approach and whether triangulation across methods could enhance data interpretation. A real dataset containing 408 subjects, 337 explanatory variables and a normally distributed outcome was used. We show that with model hyperparameters optimised to minimise cross validation error, ten methods of automated variable selection produced markedly different results; different variables were selected and model sparsity varied greatly. Comparison between multiple methods provided valuable additional insights. Two variables that were consistently selected and stable across all methods accounted for the majority of the explainable variability; these were the most plausible important candidate variables. Further variables of importance were identified from evaluating selection stability across all methods. In conclusion, triangulation of results across methods, including use of covariate stability, can greatly enhance data interpretation and confidence in variable selection.
“…Identification of a best subset of variables is known to be problematic when the number of explanatory variables is large with respect to the number of subjects and when multicollinearity is present within the data 1 . In this situation, despite their widespread use, it is recognised that selection methods based on exploratory or stepwise procedures using P-values or likelihood-based methods have notable deficiencies including producing inflated coefficient estimates and downward biased errors [1][2][3][4] . This generally results in models that are over fit and with a relatively high number of variables remaining in a 'final' model rather than a sparse model that contains only variables with the greatest association with the outcome 1 .…”
Variable selection in inferential modelling is problematic when the number of variables is large relative to the number of data points, especially when multicollinearity is present. A variety of techniques have been described to identify 'important' subsets of variables from within a large parameter space but these may produce different results which creates difficulties with inference and reproducibility. our aim was evaluate the extent to which variable selection would change depending on statistical approach and whether triangulation across methods could enhance data interpretation. A real dataset containing 408 subjects, 337 explanatory variables and a normally distributed outcome was used. We show that with model hyperparameters optimised to minimise cross validation error, ten methods of automated variable selection produced markedly different results; different variables were selected and model sparsity varied greatly. Comparison between multiple methods provided valuable additional insights. Two variables that were consistently selected and stable across all methods accounted for the majority of the explainable variability; these were the most plausible important candidate variables. Further variables of importance were identified from evaluating selection stability across all methods. In conclusion, triangulation of results across methods, including use of covariate stability, can greatly enhance data interpretation and confidence in variable selection.
“…The main objective of sparse PCA is to force a number of less important loadings to be zero, resulting in sparse eigenvectors. In order to achieve such sparsity on the extracted components, most of the available methods find the PC's of the covariance matrix through adding a constraint or penalty term from the PCA formulation (1). A constrained l 0 -norm minimisation problem is usually considered as the basic sparse PCA problem as follows: (see also [5])…”
Section: Formulation Of Sparse Pcamentioning
confidence: 99%
“…High dimensional data are rapidly growing in many different disciplines due to the development of technological advances [1]. High dimensional data are particularly common is natural language processing (NLP).…”
Section: Introductionmentioning
confidence: 99%
“…Dimensionality reduction techniques are frequently used for the analysis of high dimensional data from NLP. The curse of dimensionality reminds us the issues that emerge when working with data in higher dimensions which may not exist in lower dimensions (see, e.g., [1]). PCA and SPCA are two powerful tools for data analysis to carry out dimensionality reduction in large datasets.…”
High dimensional data are rapidly growing in many different disciplines, particularly in natural language processing. The analysis of natural language processing requires working with high dimensional matrices of word embeddings obtained from text data. Those matrices are often sparse in the sense that they contain many zero elements. Sparse principal component analysis is an advanced mathematical tool for the analysis of high dimensional data. In this paper, we study and apply the sparse principal component analysis for natural language processing, which can effectively handle large sparse matrices. We study several formulations for sparse principal component analysis, together with algorithms for implementing those formulations. Our work is motivated and illustrated by a real text dataset. We find that the sparse principal component analysis performs as good as the ordinary principal component analysis in terms of accuracy and precision, while it shows two major advantages: faster calculations and easier interpretation of the principal components. These advantages are very helpful especially in big data situations.
“…Whereas lasso regression can shrink unnecessary regressors to zero and thereby reduce the number of predictors, ridge regression retains all regressors for inclusion in the model. Both lasso and ridge regression techniques have been shown to perform well when dealing with high-dimensional data under various conditions 64 . Combining the two approaches, elastic-net regression allows for adjustment of the lasso-to-ridge ratio (α), providing greater opportunity for better model fits 65 .…”
Section: Elastic-net Regression Combines Penalty Features Of Lasso Anmentioning
Previous research suggests that the proximity of individuals in a social network predicts how similarly their brains respond to naturalistic stimuli. However, the relationship between social connectedness and brain connectivity in the absence of external stimuli has not been examined. To investigate whether neural homophily between friends exists at rest we collected resting-state functional magnetic resonance imaging (fMRI) data from 68 school-aged girls, along with social network information from all pupils in their year groups (total 5,066 social dyads). Participants were asked to rate the amount of time they voluntarily spent with each person in their year group, and directed social network matrices and community structure were then determined from these data. No statistically significant relationships between social distance, community homogeneity and similarity of global-level resting-state connectivity were observed. Nor were we able to predict social distance using a machine learning technique (i.e. elastic net regression based on the local-level similarities in resting-state whole-brain connectivity between participants). Although neural homophily between friends exists when viewing naturalistic stimuli, this finding did not extend to functional connectivity at rest in our population. Instead, resting-state connectivity may be less susceptible to the influences of a person’s social environment.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.