Variable Selection Method for the Identification of Epistatic Models

Holzinger, Emily R.; Szymczak, Silke; Dasgupta, Abhijit; Malley, James D.; Li, Qing; Bailey-Wilson, Joan E.

doi:10.1142/9789814644730_0020

Cited by 16 publications

(24 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There is no gold standard method for determining the threshold that best differentiates signal from noise (Holzinger et al ., 2015). Expert consensus (Strobl et al ., 2009) suggests that it is best not to interpret or compare importance scores but to rely on the relative rankings of the predictors.…”

Section: Methodsmentioning

confidence: 99%

“…As such, our approach provides an interpretive framework by identifying the set of predictors whose importance scores were consistently (i.e. in 100% of 5000 runs) above a standard threshold (Strobl et al ., 2009; Holzinger et al ., 2015) used for filtering out noise, thereby identifying the predictors that most consistently influence the outcome under study. As noted in Holzinger et al .…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Prediction of remission in obsessive compulsive disorder using a novel machine learning strategy

Askland

Garnaat

Sibrava

et al. 2015

Int J Methods Psych Res

View full text Add to dashboard Cite

The study objective was to apply machine learning methodologies to identify predictors of remission in a longitudinal sample of 296 adults with a primary diagnosis of obsessive compulsive disorder (OCD). Random Forests is an ensemble machine learning algorithm that has been successfully applied to large-scale data analysis across vast biomedical disciplines, though rarely in psychiatric research or for application to longitudinal data. When provided with 795 raw and composite scores primarily from baseline measures, Random Forest regression prediction explained 50.8% (5000-run average, 95% bootstrap confidence interval [CI]: 50.3–51.3%) of the variance in proportion of time spent remitted. Machine performance improved when only the most predictive 24 items were used in a reduced analysis. Consistently high-ranked predictors of longitudinal remission included Yale–Brown Obsessive Compulsive Scale (Y-BOCS) items, NEO items and subscale scores, Y-BOCS symptom checklist cleaning/washing compulsion score, and several self-report items from social adjustment scales. Random Forest classification was able to distinguish participants according to binary remission outcomes with an error rate of 24.6% (95% bootstrap CI: 22.9–26.2%). Our results suggest that clinically-useful prediction of remission may not require an extensive battery of measures. Rather, a small set of assessment items may efficiently distinguish high- and lower-risk patients and inform clinical decision-making.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Prediction of remission in obsessive compulsive disorder using a novel machine learning strategy

Askland

Garnaat

Sibrava

et al. 2015

Int J Methods Psych Res

View full text Add to dashboard Cite

show abstract

“…Consequently, it was shown that repeating the machine learning analysis several times with different random number seeds is more reliable than a single run [10, 11]. Specifically, running a machine learning algorithm multiple times with different seeds generates a distribution of VIMr values across runs.…”

Section: Methodsmentioning

confidence: 99%

“…For tree-based machine learning methods such as RF, overfitting generally occurs if the trees are allowed to continue splitting to purity [10, 11]. In other words, if the trees are allowed to become very complex, they are likely to “overreact” to noise in the data.…”

Section: Methodsmentioning

confidence: 99%

“…In other words, if the trees are allowed to become very complex, they are likely to “overreact” to noise in the data. A simple solution for this problem involves preventing the terminal nodes in any tree from containing less than about 10% of the sample size [10, 11]. This critical value (Nc) can be calculated simply by rounding the product 0.1×N, where N is the sample size.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Advantages of Synthetic Noise and Machine Learning for Analyzing Radioecological Data Sets

Shuryak

2017

PLoS ONE

View full text Add to dashboard Cite

The ecological effects of accidental or malicious radioactive contamination are insufficiently understood because of the hazards and difficulties associated with conducting studies in radioactively-polluted areas. Data sets from severely contaminated locations can therefore be small. Moreover, many potentially important factors, such as soil concentrations of toxic chemicals, pH, and temperature, can be correlated with radiation levels and with each other. In such situations, commonly-used statistical techniques like generalized linear models (GLMs) may not be able to provide useful information about how radiation and/or these other variables affect the outcome (e.g. abundance of the studied organisms). Ensemble machine learning methods such as random forests offer powerful alternatives. We propose that analysis of small radioecological data sets by GLMs and/or machine learning can be made more informative by using the following techniques: (1) adding synthetic noise variables to provide benchmarks for distinguishing the performances of valuable predictors from irrelevant ones; (2) adding noise directly to the predictors and/or to the outcome to test the robustness of analysis results against random data fluctuations; (3) adding artificial effects to selected predictors to test the sensitivity of the analysis methods in detecting predictor effects; (4) running a selected machine learning method multiple times (with different random-number seeds) to test the robustness of the detected “signal”; (5) using several machine learning methods to test the “signal’s” sensitivity to differences in analysis techniques. Here, we applied these approaches to simulated data, and to two published examples of small radioecological data sets: (I) counts of fungal taxa in samples of soil contaminated by the Chernobyl nuclear power plan accident (Ukraine), and (II) bacterial abundance in soil samples under a ruptured nuclear waste storage tank (USA). We show that the proposed techniques were advantageous compared with the methodology used in the original publications where the data sets were presented. Specifically, our approach identified a negative effect of radioactive contamination in data set I, and suggested that in data set II stable chromium could have been a stronger limiting factor for bacterial abundance than the radionuclides 137Cs and 99Tc. This new information, which was extracted from these data sets using the proposed techniques, can potentially enhance the design of radioactive waste bioremediation.

show abstract

Gene‐Gene Interaction Among WNT Genes for Oral Cleft in Trios

Kim

Suktitipat

et al. 2015

Genetic Epidemiology

Self Cite

View full text Add to dashboard Cite

Genome-wide association studies (GWAS) for non-syndromic cleft lip with or without cleft palate (CL/P) have identified multiple genes as important in the etiology of this common birth defect. We performed a candidate gene/pathway analysis explicitly considering gene-gene (G×G) interaction to further explore the etiology of CL/P. Animal models have shown the WNT signaling pathway plays an important role in mid-facial development, and various genes in this pathway have been associated with non-syndromic CL/P in previous studies. We propose a combined approach to search for possible G×G interactions using machine learning and regression-based methods to test for interactions between genes in the WNT family, and between these genes and other genes identified by genome-wide association studies (GWAS) in case-parent trios. Using this combined approach of regression-based and machine learning methods in CL/P case-parent trios, we found robust evidence of G×G interaction between markers in WNT5B and MAFB (empiric p-values =0.0076 among Asian trios and =0.018 among European trios). Additional evidence for epistatic interaction between markers in WNT5A, IRF6 and C1orf107 was seen among Asian trios, and markers in the 8q24 region and WNT5B among European trios.

show abstract

Variable Selection Method for the Identification of Epistatic Models

Cited by 16 publications

References 14 publications

Prediction of remission in obsessive compulsive disorder using a novel machine learning strategy

Prediction of remission in obsessive compulsive disorder using a novel machine learning strategy

Advantages of Synthetic Noise and Machine Learning for Analyzing Radioecological Data Sets

Gene‐Gene Interaction Among WNT Genes for Oral Cleft in Trios

Contact Info

Product

Resources

About