On Regularisation Methods for Analysis of High Dimensional Data

Sirimongkolkasem, Tanin; Drikvandi, Reza

doi:10.1007/s40745-019-00209-4

Cited by 46 publications

(40 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Identification of a best subset of variables is known to be problematic when the number of explanatory variables is large with respect to the number of subjects and when multicollinearity is present within the data 1 . In this situation, despite their widespread use, it is recognised that selection methods based on exploratory or stepwise procedures using P-values or likelihood-based methods have notable deficiencies including producing inflated coefficient estimates and downward biased errors [1][2][3][4] . This generally results in models that are over fit and with a relatively high number of variables remaining in a 'final' model rather than a sparse model that contains only variables with the greatest association with the outcome 1 .…”

mentioning

confidence: 99%

Variable selection for inferential models with relatively high-dimensional data: Between method heterogeneity and covariate stability as adjuncts to robust selection

Lima

Davies

Kaler

et al. 2020

Sci Rep

View full text Add to dashboard Cite

Variable selection in inferential modelling is problematic when the number of variables is large relative to the number of data points, especially when multicollinearity is present. A variety of techniques have been described to identify 'important' subsets of variables from within a large parameter space but these may produce different results which creates difficulties with inference and reproducibility. our aim was evaluate the extent to which variable selection would change depending on statistical approach and whether triangulation across methods could enhance data interpretation. A real dataset containing 408 subjects, 337 explanatory variables and a normally distributed outcome was used. We show that with model hyperparameters optimised to minimise cross validation error, ten methods of automated variable selection produced markedly different results; different variables were selected and model sparsity varied greatly. Comparison between multiple methods provided valuable additional insights. Two variables that were consistently selected and stable across all methods accounted for the majority of the explainable variability; these were the most plausible important candidate variables. Further variables of importance were identified from evaluating selection stability across all methods. In conclusion, triangulation of results across methods, including use of covariate stability, can greatly enhance data interpretation and confidence in variable selection.

show abstract

mentioning

confidence: 99%

Variable selection for inferential models with relatively high-dimensional data: Between method heterogeneity and covariate stability as adjuncts to robust selection

Lima

Davies

Kaler

et al. 2020

Sci Rep

View full text Add to dashboard Cite

show abstract

“…The main objective of sparse PCA is to force a number of less important loadings to be zero, resulting in sparse eigenvectors. In order to achieve such sparsity on the extracted components, most of the available methods find the PC's of the covariance matrix through adding a constraint or penalty term from the PCA formulation (1). A constrained l 0 -norm minimisation problem is usually considered as the basic sparse PCA problem as follows: (see also [5])…”

Section: Formulation Of Sparse Pcamentioning

confidence: 99%

“…High dimensional data are rapidly growing in many different disciplines due to the development of technological advances [1]. High dimensional data are particularly common is natural language processing (NLP).…”

Section: Introductionmentioning

confidence: 99%

“…Dimensionality reduction techniques are frequently used for the analysis of high dimensional data from NLP. The curse of dimensionality reminds us the issues that emerge when working with data in higher dimensions which may not exist in lower dimensions (see, e.g., [1]). PCA and SPCA are two powerful tools for data analysis to carry out dimensionality reduction in large datasets.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Sparse Principal Component Analysis for Natural Language Processing

Drikvandi

Lawal

2020

Ann. Data. Sci.

Self Cite

View full text Add to dashboard Cite

High dimensional data are rapidly growing in many different disciplines, particularly in natural language processing. The analysis of natural language processing requires working with high dimensional matrices of word embeddings obtained from text data. Those matrices are often sparse in the sense that they contain many zero elements. Sparse principal component analysis is an advanced mathematical tool for the analysis of high dimensional data. In this paper, we study and apply the sparse principal component analysis for natural language processing, which can effectively handle large sparse matrices. We study several formulations for sparse principal component analysis, together with algorithms for implementing those formulations. Our work is motivated and illustrated by a real text dataset. We find that the sparse principal component analysis performs as good as the ordinary principal component analysis in terms of accuracy and precision, while it shows two major advantages: faster calculations and easier interpretation of the principal components. These advantages are very helpful especially in big data situations.

show abstract

“…Whereas lasso regression can shrink unnecessary regressors to zero and thereby reduce the number of predictors, ridge regression retains all regressors for inclusion in the model. Both lasso and ridge regression techniques have been shown to perform well when dealing with high-dimensional data under various conditions 64 . Combining the two approaches, elastic-net regression allows for adjustment of the lasso-to-ridge ratio (α), providing greater opportunity for better model fits 65 .…”

Section: Elastic-net Regression Combines Penalty Features Of Lasso Anmentioning

confidence: 99%

No evidence for a relationship between social closeness and similarity in resting-state functional brain connectivity in schoolchildren

McNabb

Burgess

Fancourt³

et al. 2019

Preprint

View full text Add to dashboard Cite

Previous research suggests that the proximity of individuals in a social network predicts how similarly their brains respond to naturalistic stimuli. However, the relationship between social connectedness and brain connectivity in the absence of external stimuli has not been examined. To investigate whether neural homophily between friends exists at rest we collected resting-state functional magnetic resonance imaging (fMRI) data from 68 school-aged girls, along with social network information from all pupils in their year groups (total 5,066 social dyads). Participants were asked to rate the amount of time they voluntarily spent with each person in their year group, and directed social network matrices and community structure were then determined from these data. No statistically significant relationships between social distance, community homogeneity and similarity of global-level resting-state connectivity were observed. Nor were we able to predict social distance using a machine learning technique (i.e. elastic net regression based on the local-level similarities in resting-state whole-brain connectivity between participants). Although neural homophily between friends exists when viewing naturalistic stimuli, this finding did not extend to functional connectivity at rest in our population. Instead, resting-state connectivity may be less susceptible to the influences of a person’s social environment.

show abstract

On Regularisation Methods for Analysis of High Dimensional Data

Cited by 46 publications

References 36 publications

Variable selection for inferential models with relatively high-dimensional data: Between method heterogeneity and covariate stability as adjuncts to robust selection

Variable selection for inferential models with relatively high-dimensional data: Between method heterogeneity and covariate stability as adjuncts to robust selection

Sparse Principal Component Analysis for Natural Language Processing

No evidence for a relationship between social closeness and similarity in resting-state functional brain connectivity in schoolchildren

Contact Info

Product

Resources

About