There are more amino acid permutations within a 40-residue sequence than atoms on Earth. This vast chemical search space hinders the use of human learning to design functional polymers. Here we show how machine learning enables de novo design of abiotic nuclear-targeting miniproteins to traffic antisense oligomers to the nucleus of cells. We combined high-throughput experimentation with a directed evolution-inspired deep learning approach in which the molecular structures of natural and unnatural residues are represented as topological fingerprints. The model is able to predict activities beyond the training dataset, and simultaneously deciphers and visualizes sequence-activity predictions. The predicted miniproteins, termed "Mach", reach 10 kDa average mass, are more effective than any previously known variant in cells, and can also deliver proteins into the cytosol. The Mach miniproteins are nontoxic and efficiently deliver antisense cargo in mice. These results demonstrate that deep learning can decipher design principles to generate highly active biomolecules that are unlikely to be discovered by empirical approaches.
The near-infinite chemical diversity of natural and artificial macromolecules arises from the vast range of possible component monomers, linkages, and polymers topologies. This enormous variety contributes to the ubiquity and indispensability of macromolecules but hinders the development of general machine learning methods with macromolecules as input. To address this, we developed a chemistry-informed graph representation of macromolecules that enables quantifying structural similarity, and interpretable supervised learning for macromolecules. Our work enables quantitative chemistry-informed decision-making and iterative design in the macromolecular chemical space.
The chemical synthesis of polypeptides involves stepwise formation of amide bonds on an immobilized solid support. The high yields required for efficient incorporation of each individual amino acid in the growing chain are often impacted by sequence-dependent events such as aggregation. Here, we apply deep learning over ultraviolet–visible (UV–vis) analytical data collected from 35 427 individual fluorenylmethyloxycarbonyl (Fmoc) deprotection reactions performed with an automated fast-flow peptide synthesizer. The integral, height, and width of these time-resolved UV–vis deprotection traces indirectly allow for analysis of the iterative amide coupling cycles on resin. The computational model maps structural representations of amino acids and peptide sequences to experimental synthesis parameters and predicts the outcome of deprotection reactions with less than 6% error. Our deep-learning approach enables experimentally aware computational design for prediction of Fmoc deprotection efficiency and minimization of aggregation events, building the foundation for real-time optimization of peptide synthesis in flow.
Metrics & MoreArticle Recommendations CONSPECTUS: Designing new materials is vital for addressing pressing societal challenges in health, energy, and sustainability. The combination of physicochemical laws and empirical trial and error has long guided material design, but this approach is limited by the cost of experiments and the difficulty of deriving complex guiding principles. The space of hypothetical materials to be considered is incredibly large, and only a small fraction of possible compounds can ever be tested experimentally. The computational techniques of atomistic simulation and machine learning (ML) offer an avenue to rapidly invent new materials and navigate this enormous space. Together, they can be used to infer complex design principles and identify high-quality candidates more rapidly than trial-and-error experimentation. In this Account, we review our group's recent contributions to simulation and ML for materials design. We begin by discussing the numerical representation of materials for use in ML. Representations can be produced through deterministic algorithms, learnable encodings, or physics-based methods and lead to vector, graph, and matrix outputs. We describe how these different approaches offer distinct material-and application-specific advantages. We provide demonstrations from our own work on small-molecule drugs, macromolecules, dyes, electrolytes, and zeolites. In several cases, we show how the appropriate representation led to guiding principles that facilitated experimental materials design. Next, we highlight the development of ML methods for enhancing atomistic simulation. These advances help to improve simulation accuracy and expand the time and length scales that can be explored. They include differentiable atomistic simulations in which ensemble-averaged quantities are differentiated with respect to system parameters, and novel autoregressive methods for enhanced sampling of challenging physical distributions. Other developments include learnable coarse-grained models, which can accelerate molecular dynamics while minimizing the loss of all-atom information, and ML interatomic potentials, which can be trained on maximally informative quantum chemistry data through active learning and adversarial uncertainty attacks. Next, we show how these combined computational advances have enabled high-throughput virtual screening. This has led to the discovery of low-cost organic structure-directing agents for zeolite synthesis, polymer electrolytes, and efficient photoswitches for targeted medicine. We conclude by discussing the limitations of ML and simulation. These include the large data requirements and limited chemical transferability of the former and the speed−accuracy trade-offs of the latter. We predict that advancements in quantum chemistry will further accelerate simulations, while the incorporation of physical principles will improve the reliability of ML.
Therapeutic macromolecules such as proteins and oligonucleotides can be highly efficacious but are often limited to extracellular targets due to the cell’s impermeable membrane. Cell-penetrating peptides (CPPs) are able to deliver such macromolecules into cells, but limited structure–activity relationships and inconsistent literature reports make it difficult to design effective CPPs for a given cargo. For example, polyarginine motifs are common in CPPs, promoting cell uptake at the expense of systemic toxicity. Machine learning may be able to address this challenge by bridging gaps between experimental data in order to discern sequence–activity relationships that evade our intuition. Our earlier data set and deep learning model led to the design of miniproteins (>40 amino acids) for antisense delivery. Here, we leveraged and expanded our model with data augmentation in the short CPP sequence space of the data set to extrapolate and discover short, low-arginine-content CPPs that would be easier to synthesize and amenable to rapid conjugation to desired cargo, and with minimal in vivo toxicity. The lead predicted peptide, termed P6, is as active as a polyarginine CPP for the delivery of an antisense oligomer, while having only one arginine side chain and 18 total residues. We determined the pentalysine motif and the C-terminal cysteine of P6 to be the main drivers of activity. The antisense conjugate was able to enhance corrective splicing in an animal model to produce functional eGFP in heart tissue in vivo while remaining nontoxic up to a dose of 60 mg/kg. In addition, P6 was able to deliver an enzyme to the cytosol of cells. Our findings suggest that, given a data set of long CPPs, we can discover by extrapolation short, active sequences that deliver antisense oligomers.
<p>Chemical synthesis of polypeptides involves stepwise formation of amide bonds on an immobilized solid support. The high yields required for efficient incorporation of each individual amino acid in the growing chain are often impacted by sequence-dependent events such as aggregation. Here we apply deep learning over ultraviolet-visible (UV-Vis) analytical data collected from 35,485 individual fluorenylmethyloxycarbonyl (Fmoc) deprotection reactions performed with an automated fast-flow peptide synthesizer. The integral, height and width of these time-resolved UV-Vis deprotection traces indirectly allow for analysis of the iterative amide coupling cycles on resin. The computational model maps structural representations of amino acids and peptide sequences to experimental synthesis parameters and predicts the outcome of deprotection reactions with less than 4% error. Our deep learning approach enables experimentally-aware computational design for prediction of Fmoc deprotection efficiency and minimization of aggregation events, building the foundation for real-time optimization of peptide synthesis in flow.</p>
Carbohydrate-binding proteins (lectins) play vital roles in cell recognition and signaling, including pathogen binding and innate immunity. Thus, targeting lectins, especially those on the surface of immune cells, could advance immunology and drug discovery. Lectins are typically oligomeric; therefore, many of the most potent ligands are multivalent. An effective strategy for lectin targeting is to display multiple copies of a single glycan epitope on a polymer backbone; however, a drawback to such multivalent ligands is they cannot distinguish between lectins that share monosaccharide binding selectivity (e.g., mannose-binding lectins) as they often lack molecular precision. Here, we describe the development of an iterative exponential growth (IEG) synthetic strategy that enables facile access to synthetic glycomacromolecules with precisely defined and tunable sizes up to 22.5 kDa, compositions, topologies, and absolute configurations. Twelve discrete mannosylated “glyco-IEGmers” are synthesized and screened for binding to a panel of mannoside-binding immune lectins (DC-SIGN, DC-SIGNR, MBL, SP-D, langerin, dectin-2, mincle, and DEC-205). In many cases, the glyco-IEGmers had distinct length, stereochemistry, and topology-dependent lectin-binding preferences. To understand these differences, we used molecular dynamics and density functional theory simulations of octameric glyco-IEGmers, which revealed dramatic effects of glyco-IEGmer stereochemistry and topology on solution structure and reveal an interplay between conformational diversity and chiral recognition in selective lectin binding. Ligand function also could be controlled by chemical substitution: by tuning the side chains of glyco-IEGmers that bind DC-SIGN, we could alter their cellular trafficking through alteration of their aggregation state. These results highlight the power of precision synthetic oligomer/polymer synthesis for selective biological targeting, motivating the development of next-generation glycomacromolecules tailored for specific immunological or other therapeutic applications.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.