2023
DOI: 10.1101/2023.04.20.537701
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Hold out the genome: A roadmap to solving the cis-regulatory code

Abstract: Gene expression is regulated by transcription factors that work together to read cis-regulatory DNA sequences. The “cis-regulatory code” - the rules that cells use to determine when, where, and how much genes should be expressed - has proven to be exceedingly complex, but recent advances in the scale and resolution of functional genomics assays and Machine Learning have enabled significant progress towards deciphering this code. However, we will likely never solve the cis-regulatory code if we restrict ourselv… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 8 publications
(13 citation statements)
references
References 160 publications
0
10
0
Order By: Relevance
“…This significantly increases the number of training examples that the model encounters. After training, the models are fine-tuned on the original genomic sequences to further improve performance 15 .…”
Section: Phylogenetic Augmentation: a Methods For Augmenting Genomic ...mentioning
confidence: 99%
See 1 more Smart Citation
“…This significantly increases the number of training examples that the model encounters. After training, the models are fine-tuned on the original genomic sequences to further improve performance 15 .…”
Section: Phylogenetic Augmentation: a Methods For Augmenting Genomic ...mentioning
confidence: 99%
“…A major challenge in the field is determining how to train more complex deep learning models for applications outside of the most data-rich systems. A proposed solution is to substantially increase data volume by performing assays on randomly generated synthetic sequences, and then evaluating models trained on these sequences using true genomic sequences 13,15 . The reasoning behind this approach is that the genome does not contain sufficient variation to learn all aspects of the cis-regulatory code.…”
Section: Introductionmentioning
confidence: 99%
“…phylogenetically related or constrained promoters, and similar G/C gradient patterning across genes [73]). Experimental techniques such as MPRAs or oligonucleotide assembly offer the means to perturb regulatory element motif grammars and add sequence diversity [78, 17] to ensure that genomic AI models learn causal determinants of gene expression rather than simple sequence features or correlates. The tasks below benefit from substantial size and sequence diversity, though we note that because the experiments are conducted in yeast with exogenous sequences, the performance of models designed for the human genome will be affected by the distribution shift between human and yeast.…”
Section: Tasksmentioning
confidence: 99%
“…All T5 models and pretrained hgT5 models were finetuned on each task separately, with a batch size of 2 16 tokens and a learning rate of 1e-4 for 2 18 steps. We used the checkpoint corresponding to the best development set performance for testing.…”
Section: E T5 and Hgt5 (Pre)trainingmentioning
confidence: 99%
“…Such higher-order interactions may be responsible for the functional non-equivalence of identical instances of TF binding sites. However, because the number of possible arrangements of binding sites within CREs grows exponentially with the number of sites, a major obstacle to learning the effects of higher-order interactions is that the number of genomic examples of active CREs in any particular cell type is small relative to the scale of the training data needed to learn the interactions among binding sites (42,43). As a consequence, current deep learning models of cis-regulatory activity typically uncover TF motifs with large independent effects on gene expression, which tend to be the same motifs identified by traditional motif-finding algorithms.…”
Section: Introductionmentioning
confidence: 99%