In the analysis of current genomic data, application of machine learning and data mining techniques has become more attractive given the rising complexity of the projects. As part of the Genetic Analysis Workshop 19, approaches from this domain were explored, mostly motivated from two starting points. First, assuming an underlying structure in the genomic data, data mining might identify this and thus improve downstream association analyses. Second, computational methods for machine learning need to be developed further to efficiently deal with the current wealth of data.In the course of discussing results and experiences from the machine learning and data mining approaches, six common messages were extracted. These depict the current state of these approaches in the application to complex genomic data. Although some challenges remain for future studies, important forward steps were taken in the integration of different data types and the evaluation of the evidence. Mining the data for underlying genetic or phenotypic structure and using this information in subsequent analyses proved to be extremely helpful and is likely to become of even greater use with more complex data sets.
Complex diseases are defined to be determined by multiple genetic and environmental factors alone as well as in interactions. To analyze interactions in genetic data, many statistical methods have been suggested, with most of them relying on statistical regression models. Given the known limitations of classical methods, approaches from the machine-learning community have also become attractive. From this latter family, a fast-growing collection of methods emerged that are based on the Multifactor Dimensionality Reduction (MDR) approach. Since its first introduction, MDR has enjoyed great popularity in applications and has been extended and modified multiple times. Based on a literature search, we here provide a systematic and comprehensive overview of these suggested methods. The methods are described in detail, and the availability of implementations is listed. Most recent approaches offer to deal with large-scale data sets and rare variants, which is why we expect these methods to even gain in popularity.
Coronary artery disease (CAD) is the leading global cause of mortality and has substantial heritability with a polygenic architecture. Recent approaches of risk prediction were based on polygenic risk scores (PRS) not taking possible nonlinear effects into account and restricted in that they focused on genetic loci associated with CAD, only. We benchmarked PRS, (penalized) logistic regression, naïve Bayes (NB), random forests (RF), support vector machines (SVM), and gradient boosting (GB) on a data set of 7,736 CAD cases and 6,774 controls from Germany to identify the algorithms for most accurate classification of CAD status. The final models were tested on an independent data set from Germany (527 CAD cases and 473 controls). We found PRS to be the best algorithm, yielding an area under the receiver operating curve (AUC) of 0.92 (95% CI [0.90, 0.95], 50,633 loci) in the German test data. NB and SVM (AUC ~ 0.81) performed better than RF and GB (AUC ~ 0.75). We conclude that using PRS to predict CAD is superior to machine learning methods.
Background - Individual risk prediction based on genome-wide polygenic risk scores (PRS) utilizing millions of genetic variants has attracted much attention. It is under debate whether PRS models can be applied - without loss of precision - to populations of similar ethnic but different geographical background than the one the scores were trained on. Here, we examine how PRS trained in population-specific but European data sets perform in other European subpopulations in distinguishing between coronary artery disease patients and healthy individuals. Methods - We use data from UK and Estonian biobanks (UKB, EB) as well as case-control data from the German population (DE) to develop and evaluate PRS in the same and different populations. Results - PRS have the highest performance in their corresponding population testing data sets, whereas their performance significantly drops if applied to testing data sets from different European populations. Models trained on DE data revealed AUCs in independent testing sets in DE: 0.6752, EB: 0.6156, and UKB: 0.5989; trained on EB and tested on EB: 0.6565, DE: 0.5407, and UKB: 0.6043; trained on UKB and tested on UKB: 0.6133, DE: 0.5143 and EB: 0.6049. Conclusions - This result has direct impact on the clinical usability of PRS for risk prediction models utilizing PRS: a population effect must be kept in mind when applying risk estimation models which are based on additional genetic information even for individuals from different European populations of the same ethnicity.
Background Autoimmune bullous diseases (AIBD) are rare disorders characterized by autoantibody formation against components of adhesion molecules; in pemphigoid diseases (PD), these are proteins of hemidesmosomes and basement membrane, important for cell‐matrix adhesion in skin and/or mucous membranes. Incidences of these diseases vary considerably between different populations. Objectives To establish a registry prospectively recruiting all AIBD patients in a geographically well‐defined region in Northern Germany (Schleswig‐Holstein). Methods Only patients with verified disease (by clinical presentation, histology, direct and/or indirect immunofluorescence and /or ELISA) living in Schleswig‐Holstein were included. Incidences of PD were estimated based on the total number of inhabitants in Schleswig‐Holstein, stratified by birth year and sex. Results Of 67 patients with PD [35 male, 32 female, mean age 75 (standard deviation 14.3 years)], 83% were patients with bullous pemphigoid [n = 56, 28 male, 28 female, mean age 78 (SD 9.9)]. The resulting crude incidences were 23.4 patients/million/year for all pemphigoid patients, 19.6 patients/million/year for bullous pemphigoid (age‐standardized 16.9 patients/million/year) with a strong increase in bullous pemphigoid patients in the age group of 85–90 years with 262 patients/million/year. Incidences for bullous pemphigoid were higher in urban compared to rural areas. Other PD (mucous membrane pemphigoid, linear IgA disease, anti‐p200 pemphigoid) were less frequent with crude incidences of 2.1, 1.0 and 0.7 patients/million/year, respectively. Conclusions This study prospectively analyses the incidence of PD in a carefully defined geographical area. The highest incidence among PD patients was found for bullous pemphigoid. The incidence of bullous pemphigoid is considerably increased compared to previous reports and reveals regional differences. Further studies are needed in order to clarify these findings.
The prevalence of HCVAb in this general population was 4.8%. About 3% were HCVRNA positive and of these genotype 2a/ 2c was present in 81.6%.
The advancement of high-throughput sequencing technologies enables sequencing of human genomes at steadily decreasing costs and increasing quality. Before variants can be analyzed, e.g., in association studies, the raw data obtained from the sequencer need to be preprocessed. These preprocessing steps include the removal of adapters, duplicates, and contaminations, alignment to a reference genome and the postprocessing of the alignment. All later steps, such as variant discovery, rely on high data quality and proper preprocessing, emphasizing the great importance of quality control. This chapter presents a workflow for preprocessing Illumina HiSeq X sequencing data. Code snippets are provided for illustrating all necessary steps, along with a brief description of the tools and underlying methods.
Bullous pemphigoid (BP) is the most common autoimmune skin blistering disease characterized by autoimmunity against the hemidesmosomal proteins BP180, type XVII collagen, and BP230. To elucidate the genetic basis of susceptibility to BP, we performed the first genome-wide association study (GWAS) in Germans. This GWAS was combined with HLA locus targeted sequencing in an additional independent BP cohort. The strongest association with BP in Germans tested in this study was observed in the two HLA loci, HLA-DQA1*05:05 and HLA-DRB1*07:01. Further studies with increased sample sizes and complex studies integrating multiple pathogenic drivers will be conducted.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.