Optimal trade-off control in machine learning-based library design, with application to adeno-associated virus (AAV) for gene therapy

Zhu, Danqing; Brookes, David H.; Busia, Akosua; Carneiro, A.; Fannjiang, Clara; Попова, Галина; Shin, David; Donohue, Kevin. C.; Chang, Edward F.; Nowakowski, Tomasz J.; Schaffer, David V.

doi:10.1101/2021.11.02.467003

Cited by 15 publications

(39 citation statements)

References 51 publications

(108 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The design problem is a unique setting in which we have control over the data-dependent test input distribution, P X;D , since we choose the procedure used to design an input. In the simplest case, some design procedures sample from a distribution whose form is explicitly chosen, such as an energy-based model whose energy function is proportional to a trained regression model's predictions [10], or whose parameters are set by solving an optimization problem (e.g., to train a generative model) [50,29,12,17,53,70,24,55,74]. In either setting, we know the exact form of the test input distribution, which also absolves the need for density estimation.…”

Section: Algorithm 1 Pseudocode For Approximately Computingmentioning

confidence: 99%

“…The training input distribution, P X , is also often explicitly known. In protein design problems, for example, training sequences are often generated by introducing random substitutions to a single wild type sequence [12,10,14], by recombining segments of several "parent" sequences [35,52,9,22], or by independently sampling the amino acid at each position from a known distribution [74,67]. Conveniently, we can then compute the weights in Eq.…”

Section: Algorithm 1 Pseudocode For Approximately Computingmentioning

confidence: 99%

“…Recently, efforts have been made to augment such approaches with machine learning-based strategies; see reviews by Yang et al [72], Sinai and Kelsic [56], Hie and Yang [26], and Wu et al [71] and references therein. For example, one might train a regression model on protein sequences with experimentally measured fitnesses, then use an optimization algorithm or fit a generative model that leverages that regression model to propose promising new proteins [18,12,52,9,69,5,10,36,22,68,74]. Special attention has been given to the single-shot case, where the goal is to design fitter proteins given just a single batch of training data, due to its obvious practical convenience.…”

Section: Experiments With Protein Designmentioning

confidence: 99%

“…In the experiments presented here, our goal will be as follows. Given training data consisting of protein sequences labeled with experimental measurements of their fitnesses, we will fit a regression model, then sample test sequences (representing designed proteins) according to design procedures used in recent work [10,74] (Fig. 2).…”

Section: Experiments With Protein Designmentioning

confidence: 99%

“…Sampling designed sequences. Following ideas in [10,74], we design a protein by sampling from sequence distribution whose log-likelihood is proportional to the prediction of the regression model:…”

Section: Protocol For Design Experimentsmentioning

confidence: 99%

See 4 more Smart Citations

Conformal prediction for the design problem

Fannjiang¹,

Bates²,

Angelopoulos³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

In many real-world deployments of machine learning, we use a prediction algorithm to choose what data to test next. For example, in the protein design problem, we have a regression model that predicts some real-valued property of a protein sequence, which we use to propose new sequences believed to exhibit higher property values than observed in the training data. Since validating designed sequences in the wet lab is typically costly, it is important to know how much we can trust the model's predictions. In such settings, however, there is a distinct type of distribution shift between the training and test data: one where the training and test data are statistically dependent, as the latter is chosen based on the former. Consequently, the model's error on the test data-that is, the designed sequences-has some non-trivial relationship with its error on the training data. Herein, we introduce a method to quantify predictive uncertainty in such settings. We do so by constructing confidence sets for predictions that account for the dependence between the training and test data. The confidence sets we construct have finite-sample guarantees that hold for any prediction algorithm, even when a trained model chooses the test-time input distribution. As a motivating use case, we demonstrate how our method quantifies uncertainty for the predicted fitness of designed protein using several real data sets.

show abstract

Section: Algorithm 1 Pseudocode For Approximately Computingmentioning

confidence: 99%

Section: Algorithm 1 Pseudocode For Approximately Computingmentioning

confidence: 99%

Section: Experiments With Protein Designmentioning

confidence: 99%

Section: Experiments With Protein Designmentioning

confidence: 99%

Section: Protocol For Design Experimentsmentioning

confidence: 99%

See 3 more Smart Citations

Conformal prediction for the design problem

Fannjiang¹,

Bates²,

Angelopoulos³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Materiomically Designed Polymeric Vehicles for Nucleic Acids: Quo Vadis?

Kumar

2022

ACS Appl. Bio Mater.

View full text Add to dashboard Cite

Despite rapid advances in molecular biology, particularly in site-specific genome editing technologies, such as CRISPR/Cas9 and base editing, financial and logistical challenges hinder a broad population from accessing and benefiting from gene therapy. To improve the affordability and scalability of gene therapy, we need to deploy chemically defined, economical, and scalable materials, such as synthetic polymers. For polymers to deliver nucleic acids efficaciously to targeted cells, they must optimally combine design attributes, such as architecture, length, composition, spatial distribution of monomers, basicity, hydrophilic−hydrophobic phase balance, or protonation degree. Designing polymeric vectors for specific nucleic acid payloads is a multivariate optimization problem wherein even minuscule deviations from the optimum are poorly tolerated. To explore the multivariate polymer design space rapidly, efficiently, and fruitfully, we must integrate parallelized polymer synthesis, high-throughput biological screening, and statistical modeling. Although materiomics approaches promise to streamline polymeric vector development, several methodological ambiguities must be resolved. For instance, establishing a flexible polymer ontology that accommodates recent synthetic advances, enforcing uniform polymer characterization and data reporting standards, and implementing multiplexed in vitro and in vivo screening studies require considerable planning, coordination, and effort. This contribution will acquaint readers with the challenges associated with materiomics approaches to polymeric gene delivery and offers guidelines for overcoming these challenges. Here, we summarize recent developments in combinatorial polymer synthesis, high-throughput screening of polymeric vectors, omics-based approaches to polymer design, barcoding schemes for pooled in vitro and in vivo screening, and identify materiomics-inspired research directions that will realize the long-unfulfilled clinical potential of polymeric carriers in gene therapy.

show abstract

DeCOIL: Optimization of Degenerate Codon Libraries for Machine Learning-Assisted Protein Engineering

Yang,

Ducharme,

Johnston

et al. 2023

ACS Synth. Biol.

View full text Add to dashboard Cite

With advances in machine learning (ML)-assisted protein engineering, models based on data, biophysics, and natural evolution are being used to propose informed libraries of protein variants to explore. Synthesizing these libraries for experimental screens is a major bottleneck, as the cost of obtaining large numbers of exact gene sequences is often prohibitive. Degenerate codon (DC) libraries are a cost-effective alternative for generating combinatorial mutagenesis libraries where mutations are targeted to a handful of amino acid sites. However, existing computational methods to optimize DC libraries to include desired protein variants are not well suited to design libraries for ML-assisted protein engineering. To address these drawbacks, we present DEgenerate Codon Optimization for Informed Libraries (DeCOIL), a generalized method that directly optimizes DC libraries to be useful for protein engineering: to sample protein variants that are likely to have both high fitness and high diversity in the sequence search space. Using computational simulations and wet-lab experiments, we demonstrate that DeCOIL is effective across two specific case studies, with the potential to be applied to many other use cases. DeCOIL offers several advantages over existing methods, as it is direct, easy to use, generalizable, and scalable. With accompanying software (), DeCOIL can be readily implemented to generate desired informed libraries.

show abstract

Optimal trade-off control in machine learning-based library design, with application to adeno-associated virus (AAV) for gene therapy

Cited by 15 publications

References 51 publications

Conformal prediction for the design problem

Conformal prediction for the design problem

Materiomically Designed Polymeric Vehicles for Nucleic Acids: Quo Vadis?

DeCOIL: Optimization of Degenerate Codon Libraries for Machine Learning-Assisted Protein Engineering

Contact Info

Product

Resources

About