Mitchell Gill scite author profile

Genomic prediction tools support crop breeding based on statistical methods, such as the genomic best linear unbiased prediction (GBLUP). However, these tools are not designed to capture non-linear relationships within multi-dimensional datasets, or deal with high dimension datasets such as imagery collected by unmanned aerial vehicles. Machine learning (ML) algorithms have the potential to surpass the prediction accuracy of current tools used for genotype to phenotype prediction, due to their capacity to autonomously extract data features and represent their relationships at multiple levels of abstraction. This review addresses the challenges of applying statistical and machine learning methods for predicting phenotypic traits based on genetic markers, environment data, and imagery for crop breeding. We present the advantages and disadvantages of explainable model structures, discuss the potential of machine learning models for genotype to phenotype prediction in crop breeding, and the challenges, including the scarcity of high-quality datasets, inconsistent metadata annotation and the requirements of ML models.

show abstract

Pangenomes as a Resource to Accelerate Breeding of Under-Utilised Crop Species

Fernandez

Nestor

Danilevicz

et al. 2022

IJMS

View full text Add to dashboard Cite

Pangenomes are a rich resource to examine the genomic variation observed within a species or genera, supporting population genetics studies, with applications for the improvement of crop traits. Major crop species such as maize (Zea mays), rice (Oryza sativa), Brassica (Brassica spp.), and soybean (Glycine max) have had pangenomes constructed and released, and this has led to the discovery of valuable genes associated with disease resistance and yield components. However, pangenome data are not available for many less prominent crop species that are currently under-utilised. Despite many under-utilised species being important food sources in regional populations, the scarcity of genomic data for these species hinders their improvement. Here, we assess several under-utilised crops and review the pangenome approaches that could be used to build resources for their improvement. Many of these under-utilised crops are cultivated in arid or semi-arid environments, suggesting that novel genes related to drought tolerance may be identified and used for introgression into related major crop species. In addition, we discuss how previously collected data could be used to enrich pangenome functional analysis in genome-wide association studies (GWAS) based on studies in major crops. Considering the technological advances in genome sequencing, pangenome references for under-utilised species are becoming more obtainable, offering the opportunity to identify novel genes related to agro-morphological traits in these species.

show abstract

Machine learning models outperform deep learning models, provide interpretation and facilitate feature selection for soybean trait prediction

Gill

Anderson

et al. 2022

BMC Plant Biol

View full text Add to dashboard Cite

Recent growth in crop genomic and trait data have opened opportunities for the application of novel approaches to accelerate crop improvement. Machine learning and deep learning are at the forefront of prediction-based data analysis. However, few approaches for genotype to phenotype prediction compare machine learning with deep learning and further interpret the models that support the predictions. This study uses genome wide molecular markers and traits across 1110 soybean individuals to develop accurate prediction models. For 13/14 sets of predictions, XGBoost or random forest outperformed deep learning models in prediction performance. Top ranked SNPs by F-score were identified from XGBoost, and with further investigation found overlap with significantly associated loci identified from GWAS and previous literature. Feature importance rankings were used to reduce marker input by up to 90%, and subsequent models maintained or improved their prediction performance. These findings support interpretable machine learning as an approach for genomic based prediction of traits in soybean and other crops.

show abstract

An SGSGeneloss-Based Method for Constructing a Gene Presence–Absence Table Using Mosdepth

Fernandez

Marsh

Nestor

et al. 2022

View full text Add to dashboard Cite

DNABERT-based explainable lncRNA identification in plant genome assemblies

Danilevicz

Gill

Fernandez

et al. 2022

Preprint

View full text Add to dashboard Cite

Long non-coding ribonucleic acids (lncRNAs) have been shown to play an important role in plant gene regulation, being involved in both epigenetic and transcript regulation. LncRNAs are transcripts longer than 200 nucleotides that are not translated into functional proteins but can be translated into small peptides. Machine learning and deep learning models have predominantly used transcriptome data with manually defined features to detect lncRNAs, however, they often underrepresent the abundance of lncRNAs and can be biased in their detection. Here we present a study using Natural Language Processing (NLP) models to identify plant lncRNAs from genomic sequences rather than transcriptomic data. The NLP models were trained to predict lncRNAs for seven model and crop species (Zea mays, Arabidopsis thaliana, Brassica napus, Brassica oleracea, Brassica rapa, Glycine max and Oryza sativa) using publicly available genomic references. We demonstrated that lncRNAs can be accurately predicted from genomic sequences, and that genome assembly quality affects the accuracy of lncRNA identification. Furthermore, we demonstrated that the NLP models are applicable for cross-species prediction as they could predict lncRNAs from a species not used to train the model, with an average of 61% accuracy. Finally, we show that the models can be interpreted using explainable artificial intelligence to identify motifs important for lncRNA prediction and that these motifs were frequently present flanking the lncRNA sequence.

show abstract

Producing High-Quality Single Nucleotide Polymorphism Data for Genome-Wide Association Studies

Bayer

Gill

Danilevicz

et al. 2022

View full text Add to dashboard Cite

Viral and Host Genetic Factors

McCombe¹,

Acharjee²,

Gill³

et al. 2011

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Mitchell Gill

Crop breeding for a changing climate: integrating phenomics and genomics with bioinformatics

Plant Genotype to Phenotype Prediction Using Machine Learning

Pangenomes as a Resource to Accelerate Breeding of Under-Utilised Crop Species

Machine learning models outperform deep learning models, provide interpretation and facilitate feature selection for soybean trait prediction

An SGSGeneloss-Based Method for Constructing a Gene Presence–Absence Table Using Mosdepth

DNABERT-based explainable lncRNA identification in plant genome assemblies

Producing High-Quality Single Nucleotide Polymorphism Data for Genome-Wide Association Studies

Viral and Host Genetic Factors

Contact Info

Product

Resources

About