Damian Gola scite author profile

In the analysis of current genomic data, application of machine learning and data mining techniques has become more attractive given the rising complexity of the projects. As part of the Genetic Analysis Workshop 19, approaches from this domain were explored, mostly motivated from two starting points. First, assuming an underlying structure in the genomic data, data mining might identify this and thus improve downstream association analyses. Second, computational methods for machine learning need to be developed further to efficiently deal with the current wealth of data.In the course of discussing results and experiences from the machine learning and data mining approaches, six common messages were extracted. These depict the current state of these approaches in the application to complex genomic data. Although some challenges remain for future studies, important forward steps were taken in the integration of different data types and the evaluation of the evidence. Mining the data for underlying genetic or phenotypic structure and using this information in subsequent analyses proved to be extremely helpful and is likely to become of even greater use with more complex data sets.

show abstract

A roadmap to multifactor dimensionality reduction methods

Gola¹,

John

Steen

et al. 2015

Brief Bioinform

View full text Add to dashboard Cite

Complex diseases are defined to be determined by multiple genetic and environmental factors alone as well as in interactions. To analyze interactions in genetic data, many statistical methods have been suggested, with most of them relying on statistical regression models. Given the known limitations of classical methods, approaches from the machine-learning community have also become attractive. From this latter family, a fast-growing collection of methods emerged that are based on the Multifactor Dimensionality Reduction (MDR) approach. Since its first introduction, MDR has enjoyed great popularity in applications and has been extended and modified multiple times. Based on a literature search, we here provide a systematic and comprehensive overview of these suggested methods. The methods are described in detail, and the availability of implementations is listed. Most recent approaches offer to deal with large-scale data sets and rare variants, which is why we expect these methods to even gain in popularity.

show abstract

Polygenic risk scores outperform machine learning methods in predicting coronary artery disease status

Gola

Erdmann

Müller‐Myhsok

et al. 2020

Genetic Epidemiology

View full text Add to dashboard Cite

Coronary artery disease (CAD) is the leading global cause of mortality and has substantial heritability with a polygenic architecture. Recent approaches of risk prediction were based on polygenic risk scores (PRS) not taking possible nonlinear effects into account and restricted in that they focused on genetic loci associated with CAD, only. We benchmarked PRS, (penalized) logistic regression, naïve Bayes (NB), random forests (RF), support vector machines (SVM), and gradient boosting (GB) on a data set of 7,736 CAD cases and 6,774 controls from Germany to identify the algorithms for most accurate classification of CAD status. The final models were tested on an independent data set from Germany (527 CAD cases and 473 controls). We found PRS to be the best algorithm, yielding an area under the receiver operating curve (AUC) of 0.92 (95% CI [0.90, 0.95], 50,633 loci) in the German test data. NB and SVM (AUC ~ 0.81) performed better than RF and GB (AUC ~ 0.75). We conclude that using PRS to predict CAD is superior to machine learning methods.

show abstract

Population Bias in Polygenic Risk Prediction Models for Coronary Artery Disease

Gola

Erdmann

Läll

et al. 2020

Circ: Genomic and Precision Medicine

View full text Add to dashboard Cite

Background - Individual risk prediction based on genome-wide polygenic risk scores (PRS) utilizing millions of genetic variants has attracted much attention. It is under debate whether PRS models can be applied - without loss of precision - to populations of similar ethnic but different geographical background than the one the scores were trained on. Here, we examine how PRS trained in population-specific but European data sets perform in other European subpopulations in distinguishing between coronary artery disease patients and healthy individuals. Methods - We use data from UK and Estonian biobanks (UKB, EB) as well as case-control data from the German population (DE) to develop and evaluate PRS in the same and different populations. Results - PRS have the highest performance in their corresponding population testing data sets, whereas their performance significantly drops if applied to testing data sets from different European populations. Models trained on DE data revealed AUCs in independent testing sets in DE: 0.6752, EB: 0.6156, and UKB: 0.5989; trained on EB and tested on EB: 0.6565, DE: 0.5407, and UKB: 0.6043; trained on UKB and tested on UKB: 0.6133, DE: 0.5143 and EB: 0.6049. Conclusions - This result has direct impact on the clinical usability of PRS for risk prediction models utilizing PRS: a population effect must be kept in mind when applying risk estimation models which are based on additional genetic information even for individuals from different European populations of the same ethnicity.

show abstract

Incidence of pemphigoid diseases in Northern Germany in 2016 – first data from the Schleswig‐Holstein Registry of Autoimmune Bullous Diseases

Beek

Weidinger

Schneider

et al. 2021

Acad Dermatol Venereol

View full text Add to dashboard Cite

Background Autoimmune bullous diseases (AIBD) are rare disorders characterized by autoantibody formation against components of adhesion molecules; in pemphigoid diseases (PD), these are proteins of hemidesmosomes and basement membrane, important for cell‐matrix adhesion in skin and/or mucous membranes. Incidences of these diseases vary considerably between different populations. Objectives To establish a registry prospectively recruiting all AIBD patients in a geographically well‐defined region in Northern Germany (Schleswig‐Holstein). Methods Only patients with verified disease (by clinical presentation, histology, direct and/or indirect immunofluorescence and /or ELISA) living in Schleswig‐Holstein were included. Incidences of PD were estimated based on the total number of inhabitants in Schleswig‐Holstein, stratified by birth year and sex. Results Of 67 patients with PD [35 male, 32 female, mean age 75 (standard deviation 14.3 years)], 83% were patients with bullous pemphigoid [n = 56, 28 male, 28 female, mean age 78 (SD 9.9)]. The resulting crude incidences were 23.4 patients/million/year for all pemphigoid patients, 19.6 patients/million/year for bullous pemphigoid (age‐standardized 16.9 patients/million/year) with a strong increase in bullous pemphigoid patients in the age group of 85–90 years with 262 patients/million/year. Incidences for bullous pemphigoid were higher in urban compared to rural areas. Other PD (mucous membrane pemphigoid, linear IgA disease, anti‐p200 pemphigoid) were less frequent with crude incidences of 2.1, 1.0 and 0.7 patients/million/year, respectively. Conclusions This study prospectively analyses the incidence of PD in a carefully defined geographical area. The highest incidence among PD patients was found for bullous pemphigoid. The incidence of bullous pemphigoid is considerably increased compared to previous reports and reveals regional differences. Further studies are needed in order to clarify these findings.

show abstract

Hepatitis C infection in an Italian population not selected for risk factors

Maggi

Armitano²,

Brambilla

et al. 1999

Liver

View full text Add to dashboard Cite

show abstract

Preprocessing and Quality Control for Whole-Genome Sequences from the Illumina HiSeq X Platform

Wright

Gola

Ziegler

2017

View full text Add to dashboard Cite

The advancement of high-throughput sequencing technologies enables sequencing of human genomes at steadily decreasing costs and increasing quality. Before variants can be analyzed, e.g., in association studies, the raw data obtained from the sequencer need to be preprocessed. These preprocessing steps include the removal of adapters, duplicates, and contaminations, alignment to a reference genome and the postprocessing of the alignment. All later steps, such as variant discovery, rely on high data quality and proper preprocessing, emphasizing the great importance of quality control. This chapter presents a workflow for preprocessing Illumina HiSeq X sequencing data. Code snippets are provided for illustrating all necessary steps, along with a brief description of the tools and underlying methods.

show abstract

Identification of two novel bullous pemphigoid- associated alleles, HLA-DQA105:05 and -DRB107:01, in Germans

Schwarm

Gola

Holtsche

et al. 2021

Orphanet J Rare Dis

View full text Add to dashboard Cite

Bullous pemphigoid (BP) is the most common autoimmune skin blistering disease characterized by autoimmunity against the hemidesmosomal proteins BP180, type XVII collagen, and BP230. To elucidate the genetic basis of susceptibility to BP, we performed the first genome-wide association study (GWAS) in Germans. This GWAS was combined with HLA locus targeted sequencing in an additional independent BP cohort. The strongest association with BP in Germans tested in this study was observed in the two HLA loci, HLA-DQA1*05:05 and HLA-DRB1*07:01. Further studies with increased sample sizes and complex studies integrating multiple pathogenic drivers will be conducted.

show abstract

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Damian Gola

Machine learning and data mining in complex genomic data—a review on the lessons learned in Genetic Analysis Workshop 19

A roadmap to multifactor dimensionality reduction methods

Polygenic risk scores outperform machine learning methods in predicting coronary artery disease status

Population Bias in Polygenic Risk Prediction Models for Coronary Artery Disease

Incidence of pemphigoid diseases in Northern Germany in 2016 – first data from the Schleswig‐Holstein Registry of Autoimmune Bullous Diseases

Hepatitis C infection in an Italian population not selected for risk factors

Preprocessing and Quality Control for Whole-Genome Sequences from the Illumina HiSeq X Platform

Identification of two novel bullous pemphigoid- associated alleles, HLA-DQA105:05 and -DRB107:01, in Germans

Contact Info

Product

Resources

About

Damian Gola

Machine learning and data mining in complex genomic data—a review on the lessons learned in Genetic Analysis Workshop 19

A roadmap to multifactor dimensionality reduction methods

Polygenic risk scores outperform machine learning methods in predicting coronary artery disease status

Population Bias in Polygenic Risk Prediction Models for Coronary Artery Disease

Incidence of pemphigoid diseases in Northern Germany in 2016 – first data from the Schleswig‐Holstein Registry of Autoimmune Bullous Diseases

Hepatitis C infection in an Italian population not selected for risk factors

Preprocessing and Quality Control for Whole-Genome Sequences from the Illumina HiSeq X Platform

Identification of two novel bullous pemphigoid- associated alleles, HLA-DQA1*05:05 and -DRB1*07:01, in Germans

Contact Info

Product

Resources

About

Identification of two novel bullous pemphigoid- associated alleles, HLA-DQA105:05 and -DRB107:01, in Germans