An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat

Grinberg, Nastasiya F.; Orhobor, Oghenejokpeme I.; King, Ross D.

doi:10.1007/s10994-019-05848-5

Cited by 104 publications

(61 citation statements)

References 86 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…GS is fundamentally different from GWAS, as it involves use a full-genome information, regardless of its significance, in relation to a specific trait, rather than a few markers as in GWAS. This genotypic information, collected from training and validation population, is used in conjunction with corresponding phenotypic data, collected from training population, to develop a predictive model [12,14]. In forest tree breeding programs, GWAS and GS could substantially reduce the length of breeding cycles and increase genetic gain per unit time through early selection of superior genotypes during the juvenile phase.…”

Section: Introductionmentioning

confidence: 99%

“…Since these statistical methods cannot explicitly account for interactions among single nucleotide polymorphisms (SNPs), application of Machine Learning in GS studies has been proposed. Machine Learning is being increasingly applied in GS studies because it does not require any assumptions about the underlying traits, it is easy to use, and it can both capture complex non-linear relationships and efficiently increase prediction accuracy [14]. Popular Machine Learning methods include Random Forest (RF), Extreme Gradient Boosting (XgBoost) and Bayesian Additive Regression Tree (BART) modelling.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Potential of Genome-Wide Association Studies and Genomic Selection to Improve Productivity and Quality of Commercial Timber Species in Tropical Rainforest, a Case Study of Shorea platyclados

Sawitri

Tani

Na’iem

et al. 2020

Forests

View full text Add to dashboard Cite

Shorea platyclados (Dark Red Meranti) is a commercially important timber tree species in Southeast Asia. However, its stocks have dramatically declined due, inter alia, to excessive logging, insufficient natural regeneration and a slow recovery rate. Thus, there is a need to promote enrichment planting and develop effective technique to support its rehabilitation and improve timber production through implementation of Genome-Wide Association Studies (GWAS) and Genomic Selection (GS). To assist such efforts, plant materials were collected from a half-sib progeny population in Sari Bumi Kusuma forest concession, Kalimantan, Indonesia. Using 5900 markers in sequences obtained from 356 individuals, we detected high linkage disequilibrium (LD) extending up to >145 kb, suggesting that associations between phenotypic traits and markers in LD can be more easily and feasibly detected with GWAS than with analysis of quantitative trait loci (QTLs). However, the detection power of GWAS seems low, since few single nucleotide polymorphisms linked to any focal traits were detected with a stringent false discovery rate, indicating that the species’ phenotypic traits are mostly under polygenic quantitative control. Furthermore, Machine Learning provided higher prediction accuracies than Bayesian methods. We also found that stem diameter, branch diameter ratio and wood density were more predictable than height, clear bole, branch angle and wood stiffness traits. Our study suggests that GS has potential for improving the productivity and quality of S. platyclados, and our genomic heritability estimates may improve the selection of traits to target in future breeding of this species.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Potential of Genome-Wide Association Studies and Genomic Selection to Improve Productivity and Quality of Commercial Timber Species in Tropical Rainforest, a Case Study of Shorea platyclados

Sawitri

Tani

Na’iem

et al. 2020

Forests

View full text Add to dashboard Cite

show abstract

“…Previously proposed approaches for learning the imputation rules are based on regularized linear models [11][12][13][14], polygenic risk scores [11] and using the top SNP to predict expression levels [12]. However, the machine learning literature has shown that alternative approaches such as random forests (RF), which allow naturally for non-linear and non-additive effects, can produce more accurate predictions in model organisms [15,16]. We set out to explore whether using RF could also lead to better gene expression predictions in humans and, if so, whether that could be translated into a more powerful TWAS.…”

Section: Introductionmentioning

confidence: 99%

“…We also sought to take advantage of the fact that expression levels of a given gene in different cell types can be correlated by considering expression values across multiple cell types simultaneously in a multi-task framework. This has been shown to improve multi-trait predictions in yeast [16] and in applications to real and simulated data in marker-assisted selection for several related traits [17][18][19] or populations [20]. Multi-trait approaches have also been used to analyse eQTL datasets [21,22].…”

Section: Introductionmentioning

confidence: 99%

Multi-tissue transcriptome-wide association studies

Grinberg

Wallace

2020

Preprint

Self Cite

View full text Add to dashboard Cite

Many genetic mutations affecting phenotypes are presumed to do so via altering gene expression in particular cells or tissues, but identifying the specific genes involved has been challenging. A transcriptome-wide association study (TWAS) attempts to identify disease associated genes by first learning a predictive model on an eQTL dataset and then imputing gene expression levels into a larger genome-wide association study (GWAS). Finally, associations between predicted gene expressions and GWAS phenotype are identified. Here, we compared tree-based machine learning (ML) method of random forests (RF) with more widely used linear methods of lasso, ridge, and elastic net regression, for prediction of gene expression. We also developed a multi-task learning extension to RF which simultaneously makes use of information from multiple tissues (RF-MTL) and compared it to a multi-dataset version of lasso, the joint lasso, and to a single tissue RF. We found that for prediction of gene expression, RF, in general, outperformed linear approaches on our chosen eQTL dataset and that multi-tissue methods generally outperformed their single-tissue counterparts, with RF-MTL performing the best. Simulations showed that these benefits generally propagated to the next steps of the analysis, although highlighted that joint lasso had a tendency to erroneously identify genes in one tissue if there existed a disease signal for that gene in another. We tested all four methods on type 1 diabetes (T1D) GWAS and expression data for several immune cells and found that 46 genes were identified by at least one method, though only 7 by all methods. Joint lasso discovered the most T1D-associated genes, including 15 unique to that method, but this may reflect its higher false positive rate due to ''overborrowing'' information across tissues. RF-MTL found more unique associated genes than RF for 3 out 5 tissues. Compared to lasso-based analysis, the RF gene list was more likely to relate to T1D in an analysis of independent data types. We conclude that RF, both single- and multi-task version, is competitive and, for some cell types, superior to linear models conventionally used in the TWAS studies.

show abstract

“…Machine learning algorithms are increasingly being adapted for the prediction of plant phenotypes (Grinberg et al 2016(Grinberg et al , 2019. This task is most commonly regression based as most agronomic phenotypes are quantitative.…”

Section: Introductionmentioning

confidence: 99%

Predicting rice phenotypes with meta and multi-target learning

2020

Self Cite

View full text Add to dashboard Cite

The features in some machine learning datasets can naturally be divided into groups. This is the case with genomic data, where features can be grouped by chromosome. In many applications it is common for these groupings to be ignored, as interactions may exist between features belonging to different groups. However, including a group that does not influence a response introduces noise when fitting a model, leading to suboptimal predictive accuracy. Here we present two general frameworks for the generation and combination of meta-features when feature groupings are present. Furthermore, we make comparisons to multi-target learning, given that one is typically interested in predicting multiple phenotypes. We evaluated the frameworks and multi-target learning approaches on a genomic rice dataset where the regression task is to predict plant phenotype. Our results demonstrate that there are use cases for both the meta and multi-target approaches, given that overall, they significantly outperform the base case.

show abstract

An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat

Cited by 104 publications

References 86 publications

Potential of Genome-Wide Association Studies and Genomic Selection to Improve Productivity and Quality of Commercial Timber Species in Tropical Rainforest, a Case Study of Shorea platyclados

Potential of Genome-Wide Association Studies and Genomic Selection to Improve Productivity and Quality of Commercial Timber Species in Tropical Rainforest, a Case Study of Shorea platyclados

Multi-tissue transcriptome-wide association studies

Predicting rice phenotypes with meta and multi-target learning

Contact Info

Product

Resources

About