Variable selection for inferential models with relatively high-dimensional data: Between method heterogeneity and covariate stability as adjuncts to robust selection

Lima, Eliana Martins; Davies, Peter N.; Kaler, Jasmeet; Lovatt, Fiona; Green, Martin J.

doi:10.1038/s41598-020-64829-0

Cited by 22 publications

(21 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Importantly, our results also confirm the recently highlighted issue that different analytic methods used on same data can yield different results 11 , both in terms of variables selected and coefficient estimates 9 . The simulated datasets used in this study, in which the true underlying relationships were known, were useful to illustrate such differences between methods.…”

Section: Discussionsupporting

confidence: 84%

“…Over recent years, methods have been proposed in the statistical literature to improve variable selection for inference in high dimensional data, including modifications to AIC/BIC 5 , and a variety of regularisation methods based on functions that penalise model coefficients to balance over-and under-fitting (the variance-bias trade off) [6][7][8] . It has been shown, however, that different methods of variable selection can result in considerable differences in covariates selected 9 and this poses difficult questions for the researcher about which method to choose, as well as presenting wider concerns around variability of results and therefore the reproducibility of science 10,11 .…”

Section: Model Selection For Inferential Models With High Dimensionalmentioning

confidence: 99%

“…Triangulation of multiple methods has been proposed as an aid to identify important variables 13 ; in this context triangulation refers to conducting a variety of analytic methods on one set of data, on the premise that the most important variables will tend to be identified by most methods. Indeed, recent research has indicated this approach is likely to be beneficial 9 . However, rather than using triangulation to simply compare methods, a route to formally combine results from several statistical approaches would be advantageous to explicitly represent the additional uncertainty arising from variation between methods.…”

Section: Model Selection For Inferential Models With High Dimensionalmentioning

confidence: 99%

“…used to estimate covariate stability for all analytic approaches, according to methods previously described 9 . In brief, selection stability 14,15,20 was evaluated for each model as the percentage of times that each covariate was selected in the model across bootstrap samples.…”

Section: Estimation Of Selection Stability and Coefficient Distributimentioning

confidence: 99%

See 3 more Smart Citations

Model selection for inferential models with high dimensional data: synthesis and graphical representation of multiple techniques

Lima

Hyde

Green

2021

Sci Rep

Self Cite

View full text Add to dashboard Cite

Inferential research commonly involves identification of causal factors from within high dimensional data but selection of the ‘correct’ variables can be problematic. One specific problem is that results vary depending on statistical method employed and it has been argued that triangulation of multiple methods is advantageous to safely identify the correct, important variables. To date, no formal method of triangulation has been reported that incorporates both model stability and coefficient estimates; in this paper we develop an adaptable, straightforward method to achieve this. Six methods of variable selection were evaluated using simulated datasets of different dimensions with known underlying relationships. We used a bootstrap methodology to combine stability matrices across methods and estimate aggregated coefficient distributions. Novel graphical approaches provided a transparent route to visualise and compare results between methods. The proposed aggregated method provides a flexible route to formally triangulate results across any chosen number of variable selection methods and provides a combined result that incorporates uncertainty arising from between-method variability. In these simulated datasets, the combined method generally performed as well or better than the individual methods, with low error rates and clearer demarcation of the true causal variables than for the individual methods.

show abstract

Section: Discussionsupporting

confidence: 84%

Section: Model Selection For Inferential Models With High Dimensionalmentioning

confidence: 99%

Section: Model Selection For Inferential Models With High Dimensionalmentioning

confidence: 99%

Section: Estimation Of Selection Stability and Coefficient Distributimentioning

confidence: 99%

See 2 more Smart Citations

Model selection for inferential models with high dimensional data: synthesis and graphical representation of multiple techniques

Lima

Hyde

Green

2021

Sci Rep

Self Cite

View full text Add to dashboard Cite

show abstract

“…As results are expected to be variable due to the high dimensionality and the comparably low number of available years, we additionally assessed the stability of suitable datasets using different variable selection approaches in a prediction context. We applied two different methods in this study, as a comparison of different approaches is suggested if modeling is performed with high dimensional data 89 . One approach, which is methodologically comparable to our correlation analysis, is the calculation of models with single stepwise forward selection based on Pearson’s correlation and 100 repeated, threefold cross-validation cf.…”

Section: Methodsmentioning

confidence: 99%

Reanalysis datasets outperform other gridded climate products in vegetation change analysis in peripheral conservation areas of Central Asia

Zandler

Senftl

Vanselow

2020

Sci Rep

View full text Add to dashboard Cite

Global environmental research requires long-term climate data. Yet, meteorological infrastructure is missing in the vast majority of the world’s protected areas. Therefore, gridded products are frequently used as the only available climate data source in peripheral regions. However, associated evaluations are commonly biased towards well observed areas and consequently, station-based datasets. As evaluations on vegetation monitoring abilities are lacking for regions with poor data availability, we analyzed the potential of several state-of-the-art climate datasets (CHIRPS, CRU, ERA5-Land, GPCC-Monitoring-Product, IMERG-GPM, MERRA-2, MODIS-MOD10A1) for assessing NDVI anomalies (MODIS-MOD13Q1) in two particularly suitable remote conservation areas. We calculated anomalies of 156 climate variables and seasonal periods during 2001–2018, correlated these with vegetation anomalies while taking the multiple comparison problem into consideration, and computed their spatial performance to derive suitable parameters. Our results showed that four datasets (MERRA-2, ERA5-Land, MOD10A1, CRU) were suitable for vegetation analysis in both regions, by showing significant correlations controlled at a false discovery rate < 5% and in more than half of the analyzed areas. Cross-validated variable selection and importance assessment based on the Boruta algorithm indicated high importance of the reanalysis datasets ERA5-Land and MERRA-2 in both areas but higher differences and variability between the regions with all other products. CHIRPS, GPCC and the bias-corrected version of MERRA-2 were unsuitable and not important in both regions. We provide evidence that reanalysis datasets are most suitable for spatiotemporally consistent environmental analysis whereas gauge- or satellite-based products and their combinations are highly variable and may not be applicable in peripheral areas.

show abstract

Drought, psychosocial stress, and ecogeographical patterning: Tibial growth and body shape in Samburu (Kenyan) pastoralist children

Straight

Hilton

Naugle

et al. 2022

American Journal of Biological Anthropology

View full text Add to dashboard Cite

Objectives: This study of Samburu pastoralists (Kenya) employs a same-sex sibling design to test the hypothesis that exposure in utero to severe drought and maternal psychosocial stress negatively influence children's growth and adiposity. As a comparison, we also hypothesized that regional climate contrasts would influence children's growth and adiposity based on ecogeographical patterning. Materials and Methods: Anthropometric measurements were taken on Samburu children ages 1.8-9.6 years exposed to severe drought in utero and younger same-sex siblings (drought-exposed, n = 104; unexposed, n = 109) in two regions (highland, n = 128; lowland, n = 85). Mothers were interviewed to assess lifetime and pregnancy-timed stress.Results: Drought exposure associated to lower weight-for-age and higher adiposity.Drought did not associate to tibial growth on its own but the interaction between drought and region negatively associated to tibial growth in girls. In addition, drought exposure and historically low rainfall associated to tibial growth in sensitivity models.A hotter climate positively associated to adiposity and tibial growth. Culturally specific stressors (being forced to work too hard, being denied food by male kin) associated to stature and tibial growth for age. Significant covariates for child outcomes included lifetime reported trauma, wife status, and livestock.Discussion: Children exposed in utero to severe drought, a hotter climate, and psychosocial stress exhibited growth differences in our study. Our results demonstrate that climate change may deepen adverse health outcomes in populations already psychosocially and nutritionally stressed. Our results also highlight the value of ethnography to identifying meaningful stressors.

show abstract

Variable selection for inferential models with relatively high-dimensional data: Between method heterogeneity and covariate stability as adjuncts to robust selection

Cited by 22 publications

References 30 publications

Model selection for inferential models with high dimensional data: synthesis and graphical representation of multiple techniques

Model selection for inferential models with high dimensional data: synthesis and graphical representation of multiple techniques

Reanalysis datasets outperform other gridded climate products in vegetation change analysis in peripheral conservation areas of Central Asia

Drought, psychosocial stress, and ecogeographical patterning: Tibial growth and body shape in Samburu (Kenyan) pastoralist children

Contact Info

Product

Resources

About