Modeling chromatin state from sequence across angiosperms using recurrent convolutional neural networks

Wrightsman, Travis; Marand, Alexandre P.; Crisp, Peter A.; Springer, Nathan M.; Buckler, Edward S.

doi:10.1002/tpg2.20249

Cited by 5 publications

(4 citation statements)

References 101 publications

(125 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…DanQ is a hybrid convolutional and recurrent neural network specifically designed for predicting the function of DNA sequences. It has demonstrated impressive performance in predicting chromatin states in plant species, making it a suitable choice for our comparative analysis 49 . For each task, the CNN+LSTM model was trained from scratch using one-hot encoded DNA sequences as input.…”

Section: Methodsmentioning

confidence: 99%

Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language model

Zhai,

Gokaslan,

Schiff

et al. 2024

Preprint

Self Cite

View full text Add to dashboard Cite

Understanding the function and fitness effects of diverse plant genomes requires transferable models. Language models (LMs) pre-trained on large-scale biological sequences can learn evolutionary conservation, thus expected to offer better cross-species prediction through fine-tuning on limited labeled data compared to supervised deep learning models. We introduce PlantCaduceus, a plant DNA LM based on the Caduceus and Mamba architectures, pre-trained on a carefully curated dataset consisting of 16 diverse Angiosperm genomes. Fine-tuning PlantCaduceus on limited labeled Arabidopsis data for four tasks involving transcription and translation modeling demonstrated high transferability to maize that diverged 160 million years ago, outperforming the best baseline model by 1.45-fold to 7.23-fold. PlantCaduceus also enables genome-wide deleterious mutation identification without multiple sequence alignment (MSA). PlantCaduceus demonstrated a threefold enrichment of rare alleles in prioritized deleterious mutations compared to MSA-based methods and matched state-of-the-art protein LMs. PlantCaduceus is a versatile pre-trained DNA LM expected to accelerate plant genomics and crop breeding applications.

show abstract

Section: Methodsmentioning

confidence: 99%

Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language model

Zhai,

Gokaslan,

Schiff

et al. 2024

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Though tools like AlphaFold2 (Jumper et al, 2021) have dramatically improved our ability to study coding sequence, similarly performing tools do not yet exist for non-coding regions. Nevertheless, over the last decade deep learning models have rapidly improved performance in predicting non-coding genomic features such as chromatin accessibility (Kelley, 2020; Wrightsman et al, 2022), transcription factor binding (Žiga Avsec, Weilert, et al, 2021; Mejía-Guerra & Buckler, 2019), and RNA abundance (Žiga Avsec, Agarwal, et al, 2021; Linder et al, 2023) directly from DNA sequence. These models can then be queried to highlight functional non-coding sites, which can be useful for ﬁltering large sets of variants down to promising genome editing targets.…”

Section: Introductionmentioning

confidence: 99%

Current genomic deep learning architectures generalize across grass species but not alleles

Wrightsman,

Ferebee,

Romay

et al. 2024

Preprint

Self Cite

View full text Add to dashboard Cite

Non-coding regions of the genome are just as important as coding regions for understanding the mapping from genotype to phenotype. Interpreting deep learning models trained on RNA-seq is an emerging method to highlight functional sites within non-coding regions. Most of the work on RNA abundance models has been done within humans and mice, with little attention paid to plants. Here, we benchmark four genomic deep learning model architectures with genomes and RNA-seq data from 18 species closely related to maize and sorghum within the Andropogoneae. The Andropogoneae are a tribe of C4 grasses that have adapted to a wide range of environments worldwide since diverging 18 million years ago. Hundreds of millions of years of evolution across these species has produced a large, diverse pool of training alleles across species sharing a common physiology. As model input, we extracted 1,026 base pairs upstream of each gene’s translation start site. We held out maize as our test set and two closely related species as our validation set, training each architecture on the remaining Andropogoneae genomes. Within a panel of 26 maize lines, all architectures predict expression across genes moderately well but poorly across alleles. DanQ consistently ranked highest or second highest among all architectures yet performance was generally very similar across architectures despite orders of magnitude differences in size. This suggests that state-of-the-art supervised genomic deep learning models are able to generalize moderately well across related species but not sensitively separate alleles within species, the latter of which agrees with recent work within humans. We are releasing the preprocessed data and code for this work as a community benchmark to evaluate new architectures on our across-species and across-allele tasks.

show abstract

“…Generally, deep learning refers to computational methods that aim to learn a hierarchical representation of data by functionally relating the data in many layers [24]. In plant data, these models have learned to predict chromatin state from sequence [25], identify and classify key stress-responsive genes [26], and detect seasonal changes across fields [27]. In addition to these successes, many state-of-the-art models take in not only numerical data but also the structure of the relationships between the data.…”

Section: Introductionmentioning

confidence: 99%

Exploring the utility of regulatory network-based machine learning for gene expression prediction in maize

Ferebee

Buckler

2023

Preprint

Self Cite

View full text Add to dashboard Cite

Genomic selection and gene editing in crops could be enhanced by multi-species, mechanistic models predicting effects of changes in gene regulation. Current expression abundance prediction models require extensive computational resources, hard-to-measure species-specific training data, and often fail to incorporate data from multiple species. We hypothesize that gene expression prediction models that harness the regulatory network structure ofArabidopsis thalianatranscription factor-target gene interactions will improve on the present maize models. To this end, we collect 147Oryza sativaand 99Sorghum bicolorgene expression assays and assign them to maize family-based orthologous groups. Using three popular graph-based machine learning frameworks, including a shallow graph convolutional autoencoder, a deep graph convolutional autoencoder, and the inductive GraphSage strategy, we encode anArabidopsis thalianaintegrated gene regulatory network (iGRN) structure and TF gene expression values to predict gene expression both within and between species. We then evaluate the network methods against a partial least-squares baseline. We find that the baseline gives the best predictions within species, with Spearman correlations averaging between 0.74 and 0.78. The graph autoencoder methods were more variable with correlations between -0.1 and 0.65. In particular, the GraphSage and deep autoencoders performed the worst, and the shallow autoencoders performed the best. In the most challenging prediction context, where predictions were in new species and on genes that were not seen, we found that the shallow graph autoencoder framework averaged around 0.65. Unlike initial thoughts about preserved network structure improving gene expression predictions, this study shows that within-species predictions only need simple models, such as partial least squares, to capture expression variations. In cross-species predictions, the best model is often a more complex strategy utilizing regulatory network structure and other studies' expressions.

show abstract

Modeling chromatin state from sequence across angiosperms using recurrent convolutional neural networks

Cited by 5 publications

References 101 publications

Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language model

Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language model

Current genomic deep learning architectures generalize across grass species but not alleles

Exploring the utility of regulatory network-based machine learning for gene expression prediction in maize

Contact Info

Product

Resources

About