2004
DOI: 10.1002/gepi.20041
|View full text |Cite
|
Sign up to set email alerts
|

Identifying SNPs predictive of phenotype using random forests

Abstract: There has been a great interest and a few successes in the identification of complex disease susceptibility genes in recent years. Association studies, where a large number of single-nucleotide polymorphisms (SNPs) are typed in a sample of cases and controls to determine which genes are associated with a specific disease, provide a powerful approach for complex disease gene mapping. Genes of interest in those studies may contain large numbers of SNPs that classical statistical methods cannot handle simultaneou… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

2
260
0
5

Year Published

2008
2008
2021
2021

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 326 publications
(273 citation statements)
references
References 19 publications
2
260
0
5
Order By: Relevance
“…31 There has been great interest in the random forest classification procedure in studies with a large number of SNPs. 34 In unbalanced association studies the misclassification rate can be high, and therefore we use the random forest ranking only in combination with other criteria. Use of a 2-step procedure, with model selection based on AIC and a ranking function for the SNPs for validation, reduces the effect of selecting a model that by chance produces the largest difference between cases and control subjects.…”
Section: Discussionmentioning
confidence: 99%
“…31 There has been great interest in the random forest classification procedure in studies with a large number of SNPs. 34 In unbalanced association studies the misclassification rate can be high, and therefore we use the random forest ranking only in combination with other criteria. Use of a 2-step procedure, with model selection based on AIC and a ranking function for the SNPs for validation, reduces the effect of selecting a model that by chance produces the largest difference between cases and control subjects.…”
Section: Discussionmentioning
confidence: 99%
“…Use of the importance measure precludes the need to explicitly model every possible interaction terms; and makes interaction analysis of many variables less strenuous. RF was shown to perform well by simulation, 13 and in genetic studies with moderate number of variables, including microarray data analysis 14,15 and association analyses with no more than hundreds of SNPs. 13,16,17 It can effectively select the few important variables out from a large number of irrelevant ones (noise), and be used when the number of variables is much larger than the number of observations.…”
Section: Introductionmentioning
confidence: 99%
“…RF was shown to perform well by simulation, 13 and in genetic studies with moderate number of variables, including microarray data analysis 14,15 and association analyses with no more than hundreds of SNPs. 13,16,17 It can effectively select the few important variables out from a large number of irrelevant ones (noise), and be used when the number of variables is much larger than the number of observations. Recent advances such as Random Jungle (RJ) 18 have made it possible to construct large RFs from genome-wide data.…”
Section: Introductionmentioning
confidence: 99%
“…[9][10][11] Overall, random forests is among the best approaches for analyzing survival time using gene expression data. [12][13][14] In this article, we introduce one of the first methods to correlate SNP with survival outcomes.…”
Section: Introductionmentioning
confidence: 99%