Data mining, neural nets, trees — Problems 2 and 3 of Genetic Analysis Workshop 15

Ziegler, Andreas; DeStefano, Anita L.; König, Inke R.; Bardel, Claire; Brinza, Dumitru; Bull, Shelley B.; Cai, Zhaohui; Glaser, Beate; Jiang, Wei; Lee, Kristine E.; Li, Chuang Xing; Li, Jing; Li, Xin; Majoram, Paul; Meng, Yan; Nicodemus, Kristin K.; Platt, Alexander; Schwarz, Dániel; Shi, Weilang; Shugart, Yin Yao; Stassen, Hans H.; Sun, Yan V.; Won, Sungho; Wang, Wenyi; Wahba, Grace; Zagaar, Usumah A; Zhao, Zhenming

doi:10.1002/gepi.20280

Cited by 28 publications

(31 citation statements)

References 39 publications

(66 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Brinza et al [personal communication] applied a complimentary greedy search algorithm, a machine learning approach developed by his group [Brinza and Zelikovsky, 2006], to identify SNPs associated with RA, and also to predict susceptibility of the tested genotype to RA subjects using the Problem 2 data set. Their paper is discussed in more detail in the Group 6 summary paper [Ziegler et al, 2007]. Qin et al [2007] developed a graphical display tool called SIMLA-PLOT for visualizing different ways in which continuous covariates may influence the genotypespecific risk for complex human diseases.…”

Section: Single Snp and Haplotype Analysesmentioning

confidence: 99%

Summary of contributions to GAW15 Group 13: candidate gene association studies

Andrade

Allen

2007

Genet. Epidemiol.

View full text Add to dashboard Cite

Here we summarize the contributions to Group 13 of the Genetic Analysis Workshop 15 held in St. Pete Beach, Florida, on November 12-14, 2006. The focus of this group was to identify candidate genes associated with rheumatoid arthritis or surrogate outcomes. The association methods proposed in this group were diverse, from better known approaches, such as logistic regression for single nucleotide polymorphism (SNP) analysis and haplotype sharing tests to methods less familiar to genetic epidemiologists, such as machine learning and visualization methods. The majority of papers analyzed Genetic Analysis Workshop 15 Problems 2 (rheumatoid arthritis data) and 3 (simulated data). The highlighted points of this group analyses were: (1) haplotype-based statistics can be more powerful than single SNP analysis for risk-locus localization; (2) considering linkage disequilibrium block structure in haplotype analysis may reduce the likelihood of false-positive results; and (3) visual representation of genetic models for continuous covariates may help identify SNPs associated with the underlying quantitative trait loci.

show abstract

Section: Single Snp and Haplotype Analysesmentioning

confidence: 99%

Summary of contributions to GAW15 Group 13: candidate gene association studies

Andrade

Allen

2007

Genet. Epidemiol.

View full text Add to dashboard Cite

show abstract

“…Applications of random forests in medical research have mostly focused on the classification of genetic data (e.g., Schwarz et al, 2007;Schwender et al, 2004; for an overview, see Ziegler et al (2007)). As the name implies, the basic units of this method are trees, and it utilises a combination of manipulating the training cases together with introducing an additional element of randomness.…”

Section: Random Forestsmentioning

confidence: 99%

“…The second importance measure is a generalisation of the Gini index from a single tree to a forest. The basic idea of this importance measure is to contrast the impurity of a tree with and without the feature of interest being included in the tree; for details, see Ziegler et al (2007). If the estimated importance of all features can be assumed to be independent from tree to tree, a standard error of the importance can be computed in a usual way so that asymptotic confidence intervals assuming normality can be calculated (Lin et al, 2004).…”

Section: Random Forestsmentioning

confidence: 99%

“…For example, one might use random forest on a full set of features to identify the most important variables and then send this small list to a logistic regression model which then lends itself readily to interpretability (Schwarz et al, 2007). In detail, the following strategy seems to be reasonable (see, e.g., Ziegler et al, 2007):…”

Section: Clinical Interpretabilitymentioning

confidence: 99%

See 1 more Smart Citation

Patient-centered yes/no prognosis using learning machines

König

Malley

Pajevic

et al. 2008

IJDMB

View full text Add to dashboard Cite

In the last 15 years several machine learning approaches have been developed for classification and regression. In an intuitive manner we introduce the main ideas of classification and regression trees, support vector machines, bagging, boosting and random forests. We discuss differences in the use of machine learning in the biomedical community and the computer sciences. We propose methods for comparing machines on a sound statistical basis. Data from the German Stroke Study Collaboration is used for illustration. We compare the results from learning machines to those obtained by a published logistic regression and discuss similarities and differences.

show abstract

“…Although the approaches taken and goals proposed are very different, there are common themes in the approaches as well as a remarkable level of confirmation of some results. Although such important techniques as random forest [Breiman, 2001], boosting [Schapire, 1990] and ensemble approaches [Dietterich, 2000] used extensively in the data mining analyses for the other problems [Ziegler et al, 2007], these techniques were not applied in this group. Table I summarizes the 13 papers, indicating common themes among many of them.…”

mentioning

confidence: 99%

Data mining of RNA expression and DNA genotype data: Presentation Group 5 contributions to Genetic Analysis Workshop 15

et al. 2007

View full text Add to dashboard Cite

The complexity of data available in human genetics continues to grow at an explosive rate. With that growth, the challenges to understanding the meaning of the underlying information also grow. A currently popular approach to dissecting such information falls under the broad category of data mining. This can apply to any approach that tries to extract relevant information from large amounts of data, but often refers to methods that deal, in a non-linear fashion, with very large numbers of variables that cannot be simultaneously handled by more conventional statistical methods. To explore the usefulness of some of these approaches, 13 groups applied a variety of strategies to the first dataset provided to GAW 15 participants. With the extensive microarray and SNP data provided for 14 CEPH families, these groups explored multistage analyses, machine learning methods, network construction, and other techniques to try to answer questions about gene-gene interaction, functional similarities, co-regulated gene expression and the mapping of gene expression determinants, among others. In general, the methods offered strategies to provide a better understanding of the complex pathways involved in gene expression and function. These are still "works in progress," often exploratory in nature, but they provide insights into ways in which the data might be interpreted. Despite the still preliminary nature of some of these methods and the diversity of the approaches, some common themes emerged. The collection of papers and methods offer a starting point for further exploration of complex interactions in human genetic data now readily available.

show abstract

Data mining, neural nets, trees — Problems 2 and 3 of Genetic Analysis Workshop 15

Cited by 28 publications

References 39 publications

Summary of contributions to GAW15 Group 13: candidate gene association studies

Summary of contributions to GAW15 Group 13: candidate gene association studies

Patient-centered yes/no prognosis using learning machines

Data mining of RNA expression and DNA genotype data: Presentation Group 5 contributions to Genetic Analysis Workshop 15

Contact Info

Product

Resources

About