2021
DOI: 10.1101/2021.12.26.474224
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Towards More Realistic Simulated Datasets for Benchmarking Deep Learning Models in Regulatory Genomics

Abstract: Deep neural networks and support vector machines have been shown to accurately predict genomewide signals of regulatory activity from raw DNA sequences. These models are appealing in part because they can learn predictive DNA sequence features without prior assumptions. Several methods such as in-silico mutagenesis, GradCAM, DeepLIFT, Integrated Gradients and GkmExplain have been developed to reveal these learned features. However, the behavior of these methods on regulatory genomic data remains an area of act… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
5
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(7 citation statements)
references
References 31 publications
(51 reference statements)
0
5
0
Order By: Relevance
“…Simulation allows the generation of synthetic complete-knowledge ground-truth datasets (i.e., datasets whose generation rules are known and therefore contain validated properties to be learned) containing desired levels of signal and noise that reflect experimental settings and biological mechanisms [58][59][60] . Simulated datasets have been used in methodological development and calibration before large-scale datasets become available, to disentangle machine learning hypotheses and to prioritize the design of future experiments 61,62 . For antibody-antigen binding prediction, simulations may help precisely and meaningfully define different real-world antibody-antigen binding problems, which requires levels of annotation that are not yet available in experimental data.…”
Section: Introductionmentioning
confidence: 99%
“…Simulation allows the generation of synthetic complete-knowledge ground-truth datasets (i.e., datasets whose generation rules are known and therefore contain validated properties to be learned) containing desired levels of signal and noise that reflect experimental settings and biological mechanisms [58][59][60] . Simulated datasets have been used in methodological development and calibration before large-scale datasets become available, to disentangle machine learning hypotheses and to prioritize the design of future experiments 61,62 . For antibody-antigen binding prediction, simulations may help precisely and meaningfully define different real-world antibody-antigen binding problems, which requires levels of annotation that are not yet available in experimental data.…”
Section: Introductionmentioning
confidence: 99%
“…Further development of proteinstructure/affinity prediction models that can predict protein-protein 156 and protein-DNA complexes 157 that form on DNA in a sequence-dependent manner will also enable us to encode priors for TF-TF interactions and other cis-regulatory parameters, increasing the models' predictive power. Since determining what the models have learned is critical to many applications, developing and benchmarking model interpretation frameworks will be important 158,159 . Finally, having standardized datasets and modeling competitions (e.g.…”
Section: Moving Forwardsmentioning
confidence: 99%
“…Moreover, decisionmakers often defer to algorithmic decision support systems [15] and struggle to use the algorithms effectively -often underperforming compared to both humans who are not assisted and the algorithms themselves [20]. Model-agnostic methods are not alone here, as Grad-CAM [31] has been shown to perform quite poorly on tasks it was specifically designed to excel at [27] under benchmark conditions that are not sufficiently realistic.…”
Section: Human Factors In Explanationsmentioning
confidence: 99%