The genetic analysis of complex traits does not escape the current excitement around artificial intelligence, including a renewed interest in "deep learning" (DL) techniques such as Multilayer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs). However, the performance of DL for genomic prediction of complex human traits has not been comprehensively tested. To provide an evaluation of MLPs and CNNs, we used data from distantly related white Caucasian individuals (n 100k individuals, m 500k SNPs, and k = 1000) of the interim release of the UK Biobank. We analyzed a total of five phenotypes: height, bone heel mineral density, body mass index, systolic blood pressure, and waist-hip ratio, with genomic heritabilities ranging from 0.20 to 0.70. After hyperparameter optimization using a genetic algorithm, we considered several configurations, from shallow to deep learners, and compared the predictive performance of MLPs and CNNs with that of Bayesian linear regressions across sets of SNPs (from 10k to 50k) that were preselected using single-marker regression analyses. For height, a highly heritable phenotype, all methods performed similarly, although CNNs were slightly but consistently worse. For the rest of the phenotypes, the performance of some CNNs was comparable or slightly better than linear methods. Performance of MLPs was highly dependent on SNP set and phenotype. In all, over the range of traits evaluated in this study, CNN performance was competitive to linear models, but we did not find any case where DL outperformed the linear model by a sizable margin. We suggest that more research is needed to adapt CNN methodology, originally motivated by image analysis, to genetic-based problems in order for CNNs to be competitive with linear models.
The usefulness of genomic prediction in crop and livestock breeding programs has prompted efforts to develop new and improved genomic prediction algorithms, such as artificial neural networks and gradient tree boosting. However, the performance of these algorithms has not been compared in a systematic manner using a wide range of datasets and models. Using data of 18 traits across six plant species with different marker densities and training population sizes, we compared the performance of six linear and six non-linear algorithms. First, we found that hyperparameter selection was necessary for all non-linear algorithms and that feature selection prior to model training was critical for artificial neural networks when the markers greatly outnumbered the number of training lines. Across all species and trait combinations, no one algorithm performed best, however predictions based on a combination of results from multiple algorithms (i.e., ensemble predictions) performed consistently well. While linear and non-linear algorithms performed best for a similar number of traits, the performance of non-linear algorithms vary more between traits. Although artificial neural networks did not perform best for any trait, we identified strategies (i.e., feature selection, seeded starting weights) that boosted their performance to near the level of other algorithms. Our results highlight the importance of algorithm selection for the prediction of trait values.
We construct genomic predictors for heritable but extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). The constructed predictors explain, respectively, $40, 20, and 9% of total variance for the three traits, in data not used for training. For example, predicted heights correlate $0.65 with actual height; actual heights of most individuals in validation samples are within a few centimeters of the prediction. The proportion of variance explained for height is comparable to the estimated common SNP heritability from genome-wide complex trait analysis (GCTA), and seems to be close to its asymptotic value (i.e., as sample size goes to infinity), suggesting that we have captured most of the heritability for SNPs. Thus, our results close the gap between prediction R-squared and common SNP heritability. The $20k activated SNPs in our height predictor reveal the genetic architecture of human height, at least for common variants. Our primary dataset is the UK Biobank cohort, comprised of almost 500k individual genotypes with multiple phenotypes. We also use other datasets and SNPs found in earlier genome-wide association studies (GWAS) for out-of-sample validation of our results.
27!The ability to predict traits from genome-wide sequence information (Genomic 28! Prediction, GP), has improved our understanding of the genetic basis of complex traits and 29! 86! for trait prediction. However, GP-based approaches that trained on the entire transcriptome data 87! have not been used to better understand the genetic mechanisms for a trait. In addition, it is not 88! ! 4! known the degree to which transcriptomes obtained at a particular developmental stage can be 89! informative for predicting phenotypes scored at a different stage. To address these questions, we 90! used transcriptome data derived from maize whole seedling 22 to predict phenotypes (flowering 91! time, height, and grain yield) at much later developmental stages. In addition to comparing 92! prediction performance between genetic marker and transcriptome-based models, we also looked 93! at whether transcripts and genetic marker features important for the prediction models were 94! located in the same or adjacent regions. Finally, we determined how well our models were able 95! to identify a benchmark set of flowering time genes to explore the potential of using GP to better 96! understand the mechanistic basis of complex traits. 97! 98! Results and Discussion 99! Relationships between transcript levels, kinship, and phenotypes among maize lines 100!Before using the transcriptome data for GP, we first assessed properties of the 101! transcriptome data in three areas: (1) the quantity and distribution of transcript information 102! across the genome, (2) the amount of variation in transcript levels, and (3) the similarity in the 103! transcriptome profile between maize lines, with an emphasis on how these properties compared 104! to those based on the genotype data. After filtering out 16,898 transcripts that did not map to the 105! B73 reference genome or had zero or near zero variance across lines (see Methods), we had 106! 31,238 transcripts. While the number of transcripts was <10% of the number of genetic markers 107! used in this study (332,178), the distribution of transcripts along the genome was similar to the 108! genetic marker distribution (Fig. S1). The log2-transformed median transcript level across lines 109! ranged from 0 to 12.4 (median=2.2) and the variance ranged from 3x10 -30 to 14.5 (median= 110! 0.13), highlighting that a subset of transcripts had relatively high variation in transcript levels 111! across maize lines at the seedling stage. To determine how similar transcript levels were between 112! lines, we calculated the expression Correlation (eCor) between all pairs of lines using Pearson's 113! Correlation Coefficient (PCC). The eCor values ranged from 0.84 to 0.99 (mean=0.93). As 114! expected, lines with similar transcriptome profiles were also genetically similar as there was a 115! significant correlation between eCor values with values in the kinship matrix generated from the 116! genetic marker data (Spearman's Rank ρ = 0.27, p < 2.2x10 -16 ; Fig. 1A). As a result, we were 117! able to find clust...
In most crops, genetic and environmental factors interact in complex ways giving rise to substantial genotype-by-environment interactions (G×E). We propose that computer simulations leveraging field trial data, DNA sequences, and historical weather records can be used to tackle the longstanding problem of predicting cultivars’ future performances under largely uncertain weather conditions. We present a computer simulation platform that uses Monte Carlo methods to integrate uncertainty about future weather conditions and model parameters. We use extensive experimental wheat yield data (n = 25,841) to learn G×E patterns and validate, using left-trial-out cross-validation, the predictive performance of the model. Subsequently, we use the fitted model to generate circa 143 million grain yield data points for 28 wheat genotypes in 16 locations in France, over 16 years of historical weather records. The phenotypes generated by the simulation platform have multiple downstream uses; we illustrate this by predicting the distribution of expected yield at 448 cultivar-location combinations and performing means-stability analyses.
The ability to predict traits from genome-wide sequence information (i.e., genomic prediction) has improved our understanding of the genetic basis of complex traits and transformed breeding practices. Transcriptome data may also be useful for genomic prediction. However, it remains unclear how well transcript levels can predict traits, particularly when traits are scored at different development stages. Using maize (Zea mays) genetic markers and transcript levels from seedlings to predict mature plant traits, we found that transcript and genetic marker models have similar performance. When the transcripts and genetic markers with the greatest weights (i.e., the most important) in those models were used in one joint model, performance increased. Furthermore, genetic markers important for predictions were not close to or identified as regulatory variants for important transcripts. These findings demonstrate that transcript levels are useful for predicting traits and that their predictive power is not simply due to genetic variation in the transcribed genomic regions. Finally, genetic marker models identified only 1 of 14 benchmark flowering-time genes, while transcript models identified 5. These data highlight that, in addition to being useful for genomic prediction, transcriptome data can provide a link between traits and variation that cannot be readily captured at the sequence level.
Modern biobanks that collect genotype-phenotype information from hundreds of thousands of individuals bring unprecedented opportunities for genomic...
The concept of haplotype blocks has been shown to be useful in genetics. Fields of application range from the detection of regions under positive selection to statistical methods that make use of dimension reduction...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.