LegNet: a best-in-class deep learning model for short DNA regulatory regions

Penzar, Dmitry; Nogina, Daria; Meshcheryakov, G. A.; Lando, Andrey; Rafi, Abdul Muntakim; Boer, Carl G. de; Zinkevich, Arsenii; Kulakovskiy, Ivan V.

doi:10.1101/2022.12.22.521582

Cited by 3 publications

(11 citation statements)

References 26 publications

(36 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our results also support the notion that epigenetic divergence is primarily driven by sequence divergence. While neural networks have shown promise in predicting epigenetic features and gene expression levels from DNA sequence 39,41,49 , there is still a gap between current approaches and experiment-level predictions. While recent advances have been considerable, work in neural network scaling suggests improvements in model accuracy grow following a power law, requiring an exponential increase in both model and dataset size 50 .…”

Section: Discussionmentioning

confidence: 99%

“…8a). We adapt Legnet 39 to this task which has achieved state of the art prediction accuracy for short sequence MPRA activity. We trained our model on three species and evaluated on a fourth unseen species (Fig.…”

Section: Deep Learning Models Predict Cell-type Specific Chromatin Ac...mentioning

confidence: 99%

“…We trained a deep learning model to predict open chromatin based on the architecture of Legnet 39 . In short, our model takes as input a 512 base-pair bin of DNA sequence and predicts the log2(RPKM+1) normalized chromatin of accessibility within that 512 basepair bin, as well as binary peak calls across all cell types.…”

Section: Cross-species Open Chromatin Legnetmentioning

confidence: 99%

See 2 more Smart Citations

Comparative single cell epigenomic analysis of gene regulatory programs in the rodent and primate neocortex

Zemke

Armand

Wang

et al. 2023

Preprint

View full text Add to dashboard Cite

Sequence divergence of cis-regulatory elements drives species-specific traits, but how this manifests in the evolution of the neocortex at the molecular and cellular level remains to be elucidated. We investigated the gene regulatory programs in the primary motor cortex of human, macaque, marmoset, and mouse with single-cell multiomics assays, generating gene expression, chromatin accessibility, DNA methylome, and chromosomal conformation profiles from a total of over 180,000 cells. For each modality, we determined species-specific, divergent, and conserved gene expression and epigenetic features at multiple levels. We find that cell type-specific gene expression evolves more rapidly than broadly expressed genes and that epigenetic status at distal candidate cis-regulatory elements (cCREs) evolves faster than promoters. Strikingly, transposable elements (TEs) contribute to nearly 80% of the human-specific cCREs in cortical cells. Through machine learning, we develop sequence-based predictors of cCREs in different species and demonstrate that the genomic regulatory syntax is highly preserved from rodents to primates. Lastly, we show that epigenetic conservation combined with sequence similarity helps uncover functional cis-regulatory elements and enhances our ability to interpret genetic variants contributing to neurological disease and traits.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Deep Learning Models Predict Cell-type Specific Chromatin Ac...mentioning

confidence: 99%

Section: Cross-species Open Chromatin Legnetmentioning

confidence: 99%

See 1 more Smart Citation

Comparative single cell epigenomic analysis of gene regulatory programs in the rodent and primate neocortex

Zemke

Armand

Wang

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…LegNets (Penzar et al, 2022): As mentioned in Section 2, LegNets were the best predictors of PE in yeast in the DREAM challenge. We benchmark two LegNets -one with the same structure as the model that won the challenge, and a larger one with more filters in every convolutional layer.…”

Section: Mtlucifermentioning

confidence: 88%

Strategies for effectively modelling promoter-driven gene expression using transfer learning

Reddy

Herschl

Kolli

et al. 2023

Preprint

View full text Add to dashboard Cite

Advances in gene delivery technologies are enabling rapid progress in molecular medicine, but require precise expression of genetic cargo in desired cell types, which is predominantly achieved via a regulatory DNA sequence called a promoter; however, only a handful of cell type-specific promoters are known. Efficiently designing compact promoter sequences with a high density of regulatory information by leveraging machine learning models would therefore be broadly impactful for fundamental research and direct therapeutic applications. However, models of expression from such compact promoter sequences are lacking, despite the recent success of deep learning in modelling expression from endogenous regulatory sequences. Despite the lack of large datasets measuring promoter-driven expression in many cell types, data from a few well-studied cell types or from endogenous gene expression may provide relevant information for transfer learning, which has not yet been explored in this setting. Here, we evaluate a variety of pretraining tasks and transfer strategies for modelling cell type-specific expression from compact promoters and demonstrate the effectiveness of pretraining on existing promoter-driven expression datasets from other cell types. Our approach is broadly applicable for modelling promoter-driven expression in any data-limited cell type of interest, and will enable the use of model-based optimization techniques for promoter design for gene delivery applications. Our code and data are available at https://github.com/anikethjr/promoter_models.

show abstract

“…The advent of next-generation sequencing and additional high-throughput technologies has catalyzed the accumulation and public deposition of extensive databases, rich with functional genomic elements, enabling the broad application of computational methods to large-scale genomic data analysis [2]. We, along with others [3], have successfully employed machine-learning methods, including ensemble learning [4] and convolutional neural networks [5, 6], for this purpose. However, while potent, these approaches encounter constraints in identifying long-range dependencies within DNA sequences, a common phenomenon in human and other eukaryotic genomes [7].…”

Section: Mainmentioning

confidence: 99%

GENA-LM: A Family of Open-Source Foundational DNA Language Models for Long Sequences

Fishman

Kuratov

Petrov³

et al. 2023

Preprint

View full text Add to dashboard Cite

The field of genomics has seen substantial advancements through the application of artificial intelligence (AI), with machine learning revealing the potential to interpret genomic sequences without necessitating an exhaustive experimental analysis of all the intricate and interconnected molecular processes involved in DNA functioning. However, precise decoding of genomic sequences demands the comprehension of rich contextual information spread over thousands of nucleotides. Presently, only a few architectures exist that can process such extensive inputs, and they require exceptional computational resources. To address this need, we introduce GENA-LM, a suite of transformer-based foundational DNA language models capable of handling input lengths up to 36 thousands base pairs. We offer pre-trained versions of GENA-LM and demonstrate their capacity for fine-tuning to address complex biological questions with modest computational requirements. We also illustrate diverse applications of GENA-LM for various downstream genomic tasks, showcasing its performance in either matching or exceeding that of prior models, whether task-specific or universal. All models are publicly accessible on GitHub https://github.com/AIRI-Institute/GENA_LM and as pre-trained models with gena-lm- prefix on HuggingFace https://huggingface.co/AIRI-Institute .

show abstract

LegNet: a best-in-class deep learning model for short DNA regulatory regions

Cited by 3 publications

References 26 publications

Comparative single cell epigenomic analysis of gene regulatory programs in the rodent and primate neocortex

Comparative single cell epigenomic analysis of gene regulatory programs in the rodent and primate neocortex

Strategies for effectively modelling promoter-driven gene expression using transfer learning

GENA-LM: A Family of Open-Source Foundational DNA Language Models for Long Sequences

Contact Info

Product

Resources

About