Evaluation and optimization of sequence-based gene regulatory deep learning models

Rafi, Abdul Muntakim; Penzar, Dmitry; Nogina, Daria; Lee, Dohoon; Vaishnav, Eeshit Dhaval; Lee, Danyeong; Kim, Nayeon; Kim, Sangyeup; Meshcheryakov, G. A.; Lando, Andrey; Yadollahpour, Payman; Zinkevich, Arsenii; Kim, Do-Hyeon; Shin, Yeojin; Kwak, Il-Youp; Kim, Byeongchan; Lee, Juhyun; Consortium, Random Promoter DREAM Challenge; Kim, Sun; Regev, Aviv; Albrecht, Jake; Gong, Wuming; Kulakovskiy, Ivan V.; Meyer, Pablo; Boer, Carl G. de

doi:10.1101/2023.04.26.538471

Cited by 5 publications

(2 citation statements)

References 72 publications

(172 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, having standardized datasets and modeling competitions (e.g. DREAM Challenges) will facilitate the continued improvement of model efficiency and accuracy [160][161][162] .…”

Section: Moving Forwardsmentioning

confidence: 99%

Hold out the genome: a roadmap to solving the cis-regulatory code

de Boer,

Taipale

2023

Nature

Self Cite

View full text Add to dashboard Cite

Gene expression is regulated by transcription factors (TFs) that work together to read cis-regulatory DNA sequences. The "cis-regulatory code" -how cells interpret DNA sequences to determine when, where, and how much genes should be expressed -has proven to be exceedingly complex 1,2 . Recently, advances in the scale and resolution of functional genomics assays and Machine Learning (ML) have enabled significant progress towards deciphering this code 3-6 . However, the cis-regulatory code will likely never be solved if models are trained only on genomic sequences; regions of homology can easily lead to overestimation of predictive performance, and our genome is too short and has insufficient sequence diversity to learn all relevant parameters. Fortunately, randomly synthesized DNA sequences enable testing a far larger sequence space than exists in our genomes, and designed DNA sequences enable targeted queries to maximally improve the models. Since the same biochemical principles are used to interpret DNA regardless of its source, models trained on these synthetic data can predict genomic activity, often better than genome-trained models 7,8 . Here, we provide an outlook on the field, and propose a roadmap towards solving the cis-regulatory code by a combination of ML and massively parallel assays using synthetic DNA. ContributionsCGD and JT conceptualized the paper. CGD produced the first draft, analyzed the data, and made the figures with advice from JT. CGD and JT edited the manuscript.

show abstract

“…Finally, having standardized datasets and modeling competitions (e.g. DREAM Challenges) will facilitate the continued improvement of model efficiency and accuracy [160][161][162] .…”

Section: Moving Forwardsmentioning

confidence: 99%

Hold out the genome: a roadmap to solving the cis-regulatory code

de Boer,

Taipale

2023

Nature

Self Cite

View full text Add to dashboard Cite

show abstract

“…Several studies have embedded such models into algorithms for discovery of new variants using optimization methods [8] and techniques from generative models [14,15,16,17]. Although the current literature has a strong focus on improvements to model architectures that can deliver greater predictive power [18], with the size of sequence-to-expression datasets growing into thousands up to millions of variants, it is becoming increasingly clear that o↵the-shelf deep learning architectures such as convolutional neural networks, recurrent neural networks or transformers can readily provide high predictive accuracy [19].…”

Section: Introductionmentioning

confidence: 99%

DNA representations and generalization performance of sequence-to-expression models

Shen,

Kudla,

Oyarzún

2024

Preprint

View full text Add to dashboard Cite

The increasing demand for biological products drives many efforts to engineer cells that produce heterologous proteins at maximal yield. Recent advances in massively parallel reporter assays can deliver data suitable for training machine learning models and sup-port the design of microbial strains with optimized protein expression phenotypes. The best performing sequence- to-expression models have been trained on one-hot encodings, a mechanism-agnostic representation of nucleotide sequences. Despite their excellent local pre-dictive power, however, such models suffer from a limited ability to generalize predictions far away from the training data. Here, we show that libraries of genetic constructs can have substantially different cluster structure depending on the chosen sequence representation, and demonstrate that such differences can be leveraged to improve generalization perfor-mance. Using a large sequence- to-expression dataset fromEscherichia coli, we show that non-deep regressors and convolutional neural networks trained on one-hot encodings fail to generalize predictions, and that learned representations using state-of-the-art large language models also struggle with out-of-domain accuracy. In contrast, we show that despite their poorer local performance, mechanistic sequence features such as codon bias, nucleotide con-tent or mRNA stability, provide promising gains on model generalization. We explore several strategies to integrate different feature sets into a single predictive model, including feature stacking, ensemble model stacking, and geometric stacking, a novel architecture based on graph convolutional neural networks. Our work suggests that integration of domain-agnostic and domain-aware sequence features offers an unexplored route for improving the quality of sequence- to-expression models and facilitate their adoption in the biotechnology and phar-maceutical sectors.

show abstract

LegNet: a best-in-class deep learning model for short DNA regulatory regions

Penzar

Nogina

Meshcheryakov

et al. 2022

Preprint

View full text Add to dashboard Cite

Parallel reporter assays provide rich data to decipher gene regulatory regions with deep learning. Here we introduce LegNet, a convolutional network architecture that secured the first place for our autosome.org team in the DREAM 2022 challenge of predicting gene expression from gigantic parallel reporter assays. To construct LegNet, we drew inspiration from EfficientNetV2 and reformulated the sequence-to-expression regression problem as a soft-classification task. Here, with published data, we demonstrate that LegNet outperforms existing models and accurately predicts gene expression per se as well as the effects of sequence alterations, such as single-nucleotide variants.

show abstract

Evaluation and optimization of sequence-based gene regulatory deep learning models

Cited by 5 publications

References 72 publications

Hold out the genome: a roadmap to solving the cis-regulatory code

Hold out the genome: a roadmap to solving the cis-regulatory code

DNA representations and generalization performance of sequence-to-expression models

LegNet: a best-in-class deep learning model for short DNA regulatory regions

Contact Info

Product

Resources

About