Abstract:Parallel reporter assays provide rich data to decipher gene regulatory regions with deep learning. Here we introduce LegNet, a convolutional network architecture that secured the first place for our autosome.org team in the DREAM 2022 challenge of predicting gene expression from gigantic parallel reporter assays. To construct LegNet, we drew inspiration from EfficientNetV2 and reformulated the sequence-to-expression regression problem as a soft-classification task. Here, with published data, we demonstrate tha… Show more
“…Our results also support the notion that epigenetic divergence is primarily driven by sequence divergence. While neural networks have shown promise in predicting epigenetic features and gene expression levels from DNA sequence 39,41,49 , there is still a gap between current approaches and experiment-level predictions. While recent advances have been considerable, work in neural network scaling suggests improvements in model accuracy grow following a power law, requiring an exponential increase in both model and dataset size 50 .…”
Section: Discussionmentioning
confidence: 99%
“…8a). We adapt Legnet 39 to this task which has achieved state of the art prediction accuracy for short sequence MPRA activity. We trained our model on three species and evaluated on a fourth unseen species (Fig.…”
Section: Deep Learning Models Predict Cell-type Specific Chromatin Ac...mentioning
confidence: 99%
“…We trained a deep learning model to predict open chromatin based on the architecture of Legnet 39 . In short, our model takes as input a 512 base-pair bin of DNA sequence and predicts the log2(RPKM+1) normalized chromatin of accessibility within that 512 basepair bin, as well as binary peak calls across all cell types.…”
Section: Cross-species Open Chromatin Legnetmentioning
Sequence divergence of cis-regulatory elements drives species-specific traits, but how this manifests in the evolution of the neocortex at the molecular and cellular level remains to be elucidated. We investigated the gene regulatory programs in the primary motor cortex of human, macaque, marmoset, and mouse with single-cell multiomics assays, generating gene expression, chromatin accessibility, DNA methylome, and chromosomal conformation profiles from a total of over 180,000 cells. For each modality, we determined species-specific, divergent, and conserved gene expression and epigenetic features at multiple levels. We find that cell type-specific gene expression evolves more rapidly than broadly expressed genes and that epigenetic status at distal candidate cis-regulatory elements (cCREs) evolves faster than promoters. Strikingly, transposable elements (TEs) contribute to nearly 80% of the human-specific cCREs in cortical cells. Through machine learning, we develop sequence-based predictors of cCREs in different species and demonstrate that the genomic regulatory syntax is highly preserved from rodents to primates. Lastly, we show that epigenetic conservation combined with sequence similarity helps uncover functional cis-regulatory elements and enhances our ability to interpret genetic variants contributing to neurological disease and traits.
“…Our results also support the notion that epigenetic divergence is primarily driven by sequence divergence. While neural networks have shown promise in predicting epigenetic features and gene expression levels from DNA sequence 39,41,49 , there is still a gap between current approaches and experiment-level predictions. While recent advances have been considerable, work in neural network scaling suggests improvements in model accuracy grow following a power law, requiring an exponential increase in both model and dataset size 50 .…”
Section: Discussionmentioning
confidence: 99%
“…8a). We adapt Legnet 39 to this task which has achieved state of the art prediction accuracy for short sequence MPRA activity. We trained our model on three species and evaluated on a fourth unseen species (Fig.…”
Section: Deep Learning Models Predict Cell-type Specific Chromatin Ac...mentioning
confidence: 99%
“…We trained a deep learning model to predict open chromatin based on the architecture of Legnet 39 . In short, our model takes as input a 512 base-pair bin of DNA sequence and predicts the log2(RPKM+1) normalized chromatin of accessibility within that 512 basepair bin, as well as binary peak calls across all cell types.…”
Section: Cross-species Open Chromatin Legnetmentioning
Sequence divergence of cis-regulatory elements drives species-specific traits, but how this manifests in the evolution of the neocortex at the molecular and cellular level remains to be elucidated. We investigated the gene regulatory programs in the primary motor cortex of human, macaque, marmoset, and mouse with single-cell multiomics assays, generating gene expression, chromatin accessibility, DNA methylome, and chromosomal conformation profiles from a total of over 180,000 cells. For each modality, we determined species-specific, divergent, and conserved gene expression and epigenetic features at multiple levels. We find that cell type-specific gene expression evolves more rapidly than broadly expressed genes and that epigenetic status at distal candidate cis-regulatory elements (cCREs) evolves faster than promoters. Strikingly, transposable elements (TEs) contribute to nearly 80% of the human-specific cCREs in cortical cells. Through machine learning, we develop sequence-based predictors of cCREs in different species and demonstrate that the genomic regulatory syntax is highly preserved from rodents to primates. Lastly, we show that epigenetic conservation combined with sequence similarity helps uncover functional cis-regulatory elements and enhances our ability to interpret genetic variants contributing to neurological disease and traits.
“…LegNets (Penzar et al, 2022): As mentioned in Section 2, LegNets were the best predictors of PE in yeast in the DREAM challenge. We benchmark two LegNets -one with the same structure as the model that won the challenge, and a larger one with more filters in every convolutional layer.…”
Advances in gene delivery technologies are enabling rapid progress in molecular medicine, but require precise expression of genetic cargo in desired cell types, which is predominantly achieved via a regulatory DNA sequence called a promoter; however, only a handful of cell type-specific promoters are known. Efficiently designing compact promoter sequences with a high density of regulatory information by leveraging machine learning models would therefore be broadly impactful for fundamental research and direct therapeutic applications. However, models of expression from such compact promoter sequences are lacking, despite the recent success of deep learning in modelling expression from endogenous regulatory sequences. Despite the lack of large datasets measuring promoter-driven expression in many cell types, data from a few well-studied cell types or from endogenous gene expression may provide relevant information for transfer learning, which has not yet been explored in this setting. Here, we evaluate a variety of pretraining tasks and transfer strategies for modelling cell type-specific expression from compact promoters and demonstrate the effectiveness of pretraining on existing promoter-driven expression datasets from other cell types. Our approach is broadly applicable for modelling promoter-driven expression in any data-limited cell type of interest, and will enable the use of model-based optimization techniques for promoter design for gene delivery applications. Our code and data are available at https://github.com/anikethjr/promoter_models.
“…The advent of next-generation sequencing and additional high-throughput technologies has catalyzed the accumulation and public deposition of extensive databases, rich with functional genomic elements, enabling the broad application of computational methods to large-scale genomic data analysis [2]. We, along with others [3], have successfully employed machine-learning methods, including ensemble learning [4] and convolutional neural networks [5, 6], for this purpose. However, while potent, these approaches encounter constraints in identifying long-range dependencies within DNA sequences, a common phenomenon in human and other eukaryotic genomes [7].…”
The field of genomics has seen substantial advancements through the application of artificial intelligence (AI), with machine learning revealing the potential to interpret genomic sequences without necessitating an exhaustive experimental analysis of all the intricate and interconnected molecular processes involved in DNA functioning. However, precise decoding of genomic sequences demands the comprehension of rich contextual information spread over thousands of nucleotides. Presently, only a few architectures exist that can process such extensive inputs, and they require exceptional computational resources. To address this need, we introduce GENA-LM, a suite of transformer-based foundational DNA language models capable of handling input lengths up to 36 thousands base pairs. We offer pre-trained versions of GENA-LM and demonstrate their capacity for fine-tuning to address complex biological questions with modest computational requirements. We also illustrate diverse applications of GENA-LM for various downstream genomic tasks, showcasing its performance in either matching or exceeding that of prior models, whether task-specific or universal. All models are publicly accessible on GitHub https://github.com/AIRI-Institute/GENA_LM and as pre-trained models with gena-lm- prefix on HuggingFace https://huggingface.co/AIRI-Institute .
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.