Summary Plant genomes demonstrate significant presence/absence variation (PAV) within a species; however, the factors that lead to this variation have not been studied systematically in Brassica across diploids and polyploids. Here, we developed pangenomes of polyploid Brassica napus and its two diploid progenitor genomes B. rapa and B. oleracea to infer how PAV may differ between diploids and polyploids. Modelling of gene loss suggests that loss propensity is primarily associated with transposable elements in the diploids while in B. napus, gene loss propensity is associated with homoeologous recombination. We use these results to gain insights into the different causes of gene loss, both in diploids and following polyploidization, and pave the way for the application of machine learning methods to understanding the underlying biological and physical causes of gene presence/absence.
Genomic selection approaches have increased the speed of plant breeding, leading to growing crop yields over the last decade. However, climate change is impacting current and future yields, resulting in the need to further accelerate breeding efforts to cope with these changing conditions. Here we present approaches to accelerate plant breeding and incorporate nonadditive effects in genomic selection by applying state-of-the-art machine learning approaches. These approaches are made more powerful by the inclusion of pangenomes, which represent the entire genome content of a species. Understanding the strengths and limitations of machine learning methods, compared with more traditional genomic selection efforts, is paramount to the successful application of these methods in crop breeding. We describe examples of genomic selection and pangenome-based approaches in crop breeding, discuss machine learning-specific challenges, and highlight the potential for the application of machine learning in genomic selection. We believe that careful implementation of machine learning approaches will support crop improvement to help counter the adverse outcomes of climate change on crop production.
Genomic prediction tools support crop breeding based on statistical methods, such as the genomic best linear unbiased prediction (GBLUP). However, these tools are not designed to capture non-linear relationships within multi-dimensional datasets, or deal with high dimension datasets such as imagery collected by unmanned aerial vehicles. Machine learning (ML) algorithms have the potential to surpass the prediction accuracy of current tools used for genotype to phenotype prediction, due to their capacity to autonomously extract data features and represent their relationships at multiple levels of abstraction. This review addresses the challenges of applying statistical and machine learning methods for predicting phenotypic traits based on genetic markers, environment data, and imagery for crop breeding. We present the advantages and disadvantages of explainable model structures, discuss the potential of machine learning models for genotype to phenotype prediction in crop breeding, and the challenges, including the scarcity of high-quality datasets, inconsistent metadata annotation and the requirements of ML models.
Key message Quantitative resistance (QR) loci discovered through genetic and genomic analyses are abundant in the Brassica napus genome, providing an opportunity for their utilization in enhancing blackleg resistance. Abstract Quantitative resistance (QR) has long been utilized to manage blackleg in Brassica napus (canola, oilseed rape), even before major resistance genes (R-genes) were extensively explored in breeding programmes. In contrast to R-gene-mediated qualitative resistance, QR reduces blackleg symptoms rather than completely eliminating the disease. As a polygenic trait, QR is controlled by numerous genes with modest effects, which exerts less pressure on the pathogen to evolve; hence, its effectiveness is more durable compared to R-gene-mediated resistance. Furthermore, combining QR with major R-genes has been shown to enhance resistance against diseases in important crops, including oilseed rape. For these reasons, there has been a renewed interest among breeders in utilizing QR in crop improvement. However, the mechanisms governing QR are largely unknown, limiting its deployment. Advances in genomics are facilitating the dissection of the genetic and molecular underpinnings of QR, resulting in the discovery of several loci and genes that can be potentially deployed to enhance blackleg resistance. Here, we summarize the efforts undertaken to identify blackleg QR loci in oilseed rape using linkage and association analysis. We update the knowledge on the possible mechanisms governing QR and the advances in searching for the underlying genes. Lastly, we lay out strategies to accelerate the genetic improvement of blackleg QR in oilseed rape using improved phenotyping approaches and genomic prediction tools.
Recent growth in crop genomic and trait data have opened opportunities for the application of novel approaches to accelerate crop improvement. Machine learning and deep learning are at the forefront of prediction-based data analysis. However, few approaches for genotype to phenotype prediction compare machine learning with deep learning and further interpret the models that support the predictions. This study uses genome wide molecular markers and traits across 1110 soybean individuals to develop accurate prediction models. For 13/14 sets of predictions, XGBoost or random forest outperformed deep learning models in prediction performance. Top ranked SNPs by F-score were identified from XGBoost, and with further investigation found overlap with significantly associated loci identified from GWAS and previous literature. Feature importance rankings were used to reduce marker input by up to 90%, and subsequent models maintained or improved their prediction performance. These findings support interpretable machine learning as an approach for genomic based prediction of traits in soybean and other crops.
Cheap genome sequencing technology has made it possible to search for genomic variants called single nucleotide polymorphisms (SNPs) for hundreds of individuals. Linking these genomic variants to phenotypes is the main goal in running genome‐wide association studies (GWAS). SNPs can be discovered and called using different technologies and methods, and subsequent quality control must be performed taking into account the species of study and genotyping techniques. GWAS can be performed using different mathematical approaches, demonstrated within the current range of software packages, which are used to perform the GWAS and interpret the subsequent results. Key Concepts GWAS is a powerful tool to associate genomic variants with phenotypes. Quality control is core to any good GWAS. Numerous powerful tools now exist to make running a GWAS straightforward. Interpreting the output is still a challenge, especially in the presence of hidden confounding factors. GWAS can be performed using different types of genotyping data, each with its own advantages and disadvantages.
9Recent advances in long-read sequencing have the potential to produce more complete genome 10 assemblies using sequence reads which can span repetitive regions. However, overlap based assembly 11 methods routinely used for this data require significant computing time and resources. Here, we have 12 developed RefKA, a reference-based approach for long read genome assembly. This approach relies on 13 breaking up a closely related reference genome into bins, aligning k-mers unique to each bin with 14 PacBio reads, and then assembling each bin in parallel followed by a final bin-stitching step. During 15 benchmarking, we assembled the wheat Chinese Spring (CS) genome using publicly available PacBio 16 reads in parallel in 168 wall hours on a 250 CPU system. The maximum RAM used was 300 Gb and the 17 computing time was 42,000 CPU hours. The approach opens applications for the assembly of other 18 large and complex genomes with much-reduced computing requirements. The RefKA pipeline is
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.