There has been a great interest and a few successes in the identification of complex disease susceptibility genes in recent years. Association studies, where a large number of single-nucleotide polymorphisms (SNPs) are typed in a sample of cases and controls to determine which genes are associated with a specific disease, provide a powerful approach for complex disease gene mapping. Genes of interest in those studies may contain large numbers of SNPs that classical statistical methods cannot handle simultaneously without requiring prohibitively large sample sizes. By contrast, high-dimensional nonparametric methods thrive on large numbers of predictors. This work explores the application of one such method, random forests, to the problem of identifying SNPs predictive of the phenotype in the case-control study design. A random forest is a collection of classification trees grown on bootstrap samples of observations, using a random subset of predictors to define the best split at each node. The observations left out of the bootstrap samples are used to estimate prediction error. The importance of a predictor is quantified by the increase in misclassification occurring when the values of the predictor are randomly permuted. We extend the concept of importance to pairs of predictors, to capture joint effects, and we explore the behavior of importance measures over a range of two-locus disease models in the presence of a varying number of SNPs unassociated with the phenotype. We illustrate the application of random forests with a data set of asthma cases and unaffected controls genotyped at 42 SNPs in ADAM33, a previously identified asthma susceptibility gene. SNPs and SNP pairs highly associated with asthma tend to have the highest importance index value, but predictive importance and association do not always coincide.
An autosomal recessive syndrome of nonprogressive cerebellar ataxia and mental retardation is associated with inferior cerebellar hypoplasia and mild cerebral gyral simplification in the Hutterite population. An identity-by-descent mapping approach using eight patients from three interrelated Hutterite families localized the gene for this syndrome to chromosome region 9p24. Haplotype analysis identified familial and ancestral recombination events and refined the minimal region to a 2-Mb interval between markers D9S129 and D9S1871. A 199-kb homozygous deletion encompassing the entire very low density lipoprotein receptor (VLDLR) gene was present in all affected individuals. VLDLR is part of the reelin signaling pathway, which guides neuroblast migration in the cerebral cortex and cerebellum. To our knowledge, this syndrome represents the first human lipoprotein receptor malformation syndrome and the second human disease associated with a reelin pathway defect.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.