Abstract:Neural networks have proven to be an immensely powerful tool in predicting functional genomic regions, in particular with many recent successes in deciphering gene regulatory logic. However, how model architecture and training strategy choices affect model performance has not been systematically evaluated for genomics models. To address this gap, we held a DREAM Challenge where competitors trained models on a dataset of millions of random promoter DNA sequences and corresponding experimentally determined expre… Show more
“…Finally, having standardized datasets and modeling competitions (e.g. DREAM Challenges) will facilitate the continued improvement of model efficiency and accuracy [160][161][162] .…”
Gene expression is regulated by transcription factors (TFs) that work together to read cis-regulatory DNA sequences. The "cis-regulatory code" -how cells interpret DNA sequences to determine when, where, and how much genes should be expressed -has proven to be exceedingly complex 1,2 . Recently, advances in the scale and resolution of functional genomics assays and Machine Learning (ML) have enabled significant progress towards deciphering this code 3-6 . However, the cis-regulatory code will likely never be solved if models are trained only on genomic sequences; regions of homology can easily lead to overestimation of predictive performance, and our genome is too short and has insufficient sequence diversity to learn all relevant parameters. Fortunately, randomly synthesized DNA sequences enable testing a far larger sequence space than exists in our genomes, and designed DNA sequences enable targeted queries to maximally improve the models. Since the same biochemical principles are used to interpret DNA regardless of its source, models trained on these synthetic data can predict genomic activity, often better than genome-trained models 7,8 . Here, we provide an outlook on the field, and propose a roadmap towards solving the cis-regulatory code by a combination of ML and massively parallel assays using synthetic DNA.
ContributionsCGD and JT conceptualized the paper. CGD produced the first draft, analyzed the data, and made the figures with advice from JT. CGD and JT edited the manuscript.
“…Finally, having standardized datasets and modeling competitions (e.g. DREAM Challenges) will facilitate the continued improvement of model efficiency and accuracy [160][161][162] .…”
Gene expression is regulated by transcription factors (TFs) that work together to read cis-regulatory DNA sequences. The "cis-regulatory code" -how cells interpret DNA sequences to determine when, where, and how much genes should be expressed -has proven to be exceedingly complex 1,2 . Recently, advances in the scale and resolution of functional genomics assays and Machine Learning (ML) have enabled significant progress towards deciphering this code 3-6 . However, the cis-regulatory code will likely never be solved if models are trained only on genomic sequences; regions of homology can easily lead to overestimation of predictive performance, and our genome is too short and has insufficient sequence diversity to learn all relevant parameters. Fortunately, randomly synthesized DNA sequences enable testing a far larger sequence space than exists in our genomes, and designed DNA sequences enable targeted queries to maximally improve the models. Since the same biochemical principles are used to interpret DNA regardless of its source, models trained on these synthetic data can predict genomic activity, often better than genome-trained models 7,8 . Here, we provide an outlook on the field, and propose a roadmap towards solving the cis-regulatory code by a combination of ML and massively parallel assays using synthetic DNA.
ContributionsCGD and JT conceptualized the paper. CGD produced the first draft, analyzed the data, and made the figures with advice from JT. CGD and JT edited the manuscript.
“…Several studies have embedded such models into algorithms for discovery of new variants using optimization methods [8] and techniques from generative models [14,15,16,17]. Although the current literature has a strong focus on improvements to model architectures that can deliver greater predictive power [18], with the size of sequence-to-expression datasets growing into thousands up to millions of variants, it is becoming increasingly clear that o↵the-shelf deep learning architectures such as convolutional neural networks, recurrent neural networks or transformers can readily provide high predictive accuracy [19].…”
The increasing demand for biological products drives many efforts to engineer cells that produce heterologous proteins at maximal yield. Recent advances in massively parallel reporter assays can deliver data suitable for training machine learning models and sup-port the design of microbial strains with optimized protein expression phenotypes. The best performing sequence- to-expression models have been trained on one-hot encodings, a mechanism-agnostic representation of nucleotide sequences. Despite their excellent local pre-dictive power, however, such models suffer from a limited ability to generalize predictions far away from the training data. Here, we show that libraries of genetic constructs can have substantially different cluster structure depending on the chosen sequence representation, and demonstrate that such differences can be leveraged to improve generalization perfor-mance. Using a large sequence- to-expression dataset fromEscherichia coli, we show that non-deep regressors and convolutional neural networks trained on one-hot encodings fail to generalize predictions, and that learned representations using state-of-the-art large language models also struggle with out-of-domain accuracy. In contrast, we show that despite their poorer local performance, mechanistic sequence features such as codon bias, nucleotide con-tent or mRNA stability, provide promising gains on model generalization. We explore several strategies to integrate different feature sets into a single predictive model, including feature stacking, ensemble model stacking, and geometric stacking, a novel architecture based on graph convolutional neural networks. Our work suggests that integration of domain-agnostic and domain-aware sequence features offers an unexplored route for improving the quality of sequence- to-expression models and facilitate their adoption in the biotechnology and phar-maceutical sectors.
Parallel reporter assays provide rich data to decipher gene regulatory regions with deep learning. Here we introduce LegNet, a convolutional network architecture that secured the first place for our autosome.org team in the DREAM 2022 challenge of predicting gene expression from gigantic parallel reporter assays. To construct LegNet, we drew inspiration from EfficientNetV2 and reformulated the sequence-to-expression regression problem as a soft-classification task. Here, with published data, we demonstrate that LegNet outperforms existing models and accurately predicts gene expression per se as well as the effects of sequence alterations, such as single-nucleotide variants.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.