Towards More Realistic Simulated Datasets for Benchmarking Deep Learning Models in Regulatory Genomics

Prakash, Eva; Shrikumar, Avanti; Kundaje, Anshul

doi:10.1101/2021.12.26.474224

Cited by 9 publications

(7 citation statements)

References 31 publications

(51 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Simulation allows the generation of synthetic complete-knowledge ground-truth datasets (i.e., datasets whose generation rules are known and therefore contain validated properties to be learned) containing desired levels of signal and noise that reflect experimental settings and biological mechanisms [58][59][60] . Simulated datasets have been used in methodological development and calibration before large-scale datasets become available, to disentangle machine learning hypotheses and to prioritize the design of future experiments 61,62 . For antibody-antigen binding prediction, simulations may help precisely and meaningfully define different real-world antibody-antigen binding problems, which requires levels of annotation that are not yet available in experimental data.…”

Section: Introductionmentioning

confidence: 99%

Unconstrained generation of synthetic antibody–antigen structures to guide machine learning methodology for antibody specificity prediction

et al. 2022

View full text Add to dashboard Cite

Machine learning (ML) is a key technology for accurate prediction of antibody-antigen binding. Two orthogonal problems hinder the application of ML to antibody-specificity prediction and the benchmarking thereof: The lack of a unified ML formalization of immunological antibody specificity prediction problems and the unavailability of large-scale synthetic benchmarking datasets of real-world relevance. Here, we developed the Absolut! software suite that enables parameter-based unconstrained generation of synthetic lattice-based 3D-antibody-antigen binding structures with ground-truth access to conformational paratope, epitope, and affinity. We formalized common immunological antibody specificity prediction problems as ML tasks and confirmed that for both sequence and structure-based tasks, accuracy-based rankings of ML methods trained on experimental data hold for ML methods trained on Absolut!-generated data. The Absolut! framework thus enables real-world relevant development and benchmarking of ML strategies for biotherapeutics design.

show abstract

Section: Introductionmentioning

confidence: 99%

Unconstrained generation of synthetic antibody–antigen structures to guide machine learning methodology for antibody specificity prediction

et al. 2022

View full text Add to dashboard Cite

show abstract

“…Further development of proteinstructure/affinity prediction models that can predict protein-protein 156 and protein-DNA complexes 157 that form on DNA in a sequence-dependent manner will also enable us to encode priors for TF-TF interactions and other cis-regulatory parameters, increasing the models' predictive power. Since determining what the models have learned is critical to many applications, developing and benchmarking model interpretation frameworks will be important 158,159 . Finally, having standardized datasets and modeling competitions (e.g.…”

Section: Moving Forwardsmentioning

confidence: 99%

Hold out the genome: a roadmap to solving the cis-regulatory code

de Boer,

Taipale

2023

Nature

View full text Add to dashboard Cite

Gene expression is regulated by transcription factors (TFs) that work together to read cis-regulatory DNA sequences. The "cis-regulatory code" -how cells interpret DNA sequences to determine when, where, and how much genes should be expressed -has proven to be exceedingly complex 1,2 . Recently, advances in the scale and resolution of functional genomics assays and Machine Learning (ML) have enabled significant progress towards deciphering this code 3-6 . However, the cis-regulatory code will likely never be solved if models are trained only on genomic sequences; regions of homology can easily lead to overestimation of predictive performance, and our genome is too short and has insufficient sequence diversity to learn all relevant parameters. Fortunately, randomly synthesized DNA sequences enable testing a far larger sequence space than exists in our genomes, and designed DNA sequences enable targeted queries to maximally improve the models. Since the same biochemical principles are used to interpret DNA regardless of its source, models trained on these synthetic data can predict genomic activity, often better than genome-trained models 7,8 . Here, we provide an outlook on the field, and propose a roadmap towards solving the cis-regulatory code by a combination of ML and massively parallel assays using synthetic DNA. ContributionsCGD and JT conceptualized the paper. CGD produced the first draft, analyzed the data, and made the figures with advice from JT. CGD and JT edited the manuscript.

show abstract

“…Moreover, decisionmakers often defer to algorithmic decision support systems [15] and struggle to use the algorithms effectively -often underperforming compared to both humans who are not assisted and the algorithms themselves [20]. Model-agnostic methods are not alone here, as Grad-CAM [31] has been shown to perform quite poorly on tasks it was specifically designed to excel at [27] under benchmark conditions that are not sufficiently realistic.…”

Section: Human Factors In Explanationsmentioning

confidence: 99%

Robustness and Usefulness in AI Explanation Methods

Galinkin

2022

Preprint

View full text Add to dashboard Cite

Explainability in machine learning has become incredibly important as machine learning-powered systems become ubiquitous and both regulation and public sentiment begin to demand an understanding of how these systems make decisions. As a result, a number of explanation methods have begun to receive widespread adoption. This work summarizes, compares, and contrasts three popular explanation methods: LIME, SmoothGrad, and SHAP. We evaluate these methods with respect to: robustness, in the sense of sample complexity and stability; understandability, in the sense that provided explanations are consistent with user expectations; and usability, in the sense that the explanations allow for the model to be modified based on the output. This work concludes that current explanation methods are insufficient; that putting faith in and adopting these methods may actually be worse than simply not using them.

show abstract

Towards More Realistic Simulated Datasets for Benchmarking Deep Learning Models in Regulatory Genomics

Cited by 9 publications

References 31 publications

Unconstrained generation of synthetic antibody–antigen structures to guide machine learning methodology for antibody specificity prediction

Unconstrained generation of synthetic antibody–antigen structures to guide machine learning methodology for antibody specificity prediction

Hold out the genome: a roadmap to solving the cis-regulatory code

Robustness and Usefulness in AI Explanation Methods

Contact Info

Product

Resources

About