Here, we systematically decompose the known protein structural universe into its basic elements, which we dub tertiary structural motifs (TERMs). A TERM is a compact backbone fragment that captures the secondary, tertiary, and quaternary environments around a given residue, comprising one or more disjoint segments (three on average). We seek the set of universal TERMs that capture all structure in the Protein Data Bank (PDB), finding remarkable degeneracy. Only ∼600 TERMs are sufficient to describe 50% of the PDB at sub-Angstrom resolution. However, more rare geometries also exist, and the overall structural coverage grows logarithmically with the number of TERMs. We go on to show that universal TERMs provide an effective mapping between sequence and structure. We demonstrate that TERM-based statistics alone are sufficient to recapitulate close-to-native sequences given either NMR or X-ray backbones. Furthermore, sequence variability predicted from TERM data agrees closely with evolutionary variation. Finally, locations of TERMs in protein chains can be predicted from sequence alone based on sequence signatures emergent from TERM instances in the PDB. For multisegment motifs, this method identifies spatially adjacent fragments that are not contiguous in sequence-a major bottleneck in structure prediction. Although all TERMs recur in diverse proteins, some appear specialized for certain functions, such as interface formation, metal coordination, or even water binding. Structural biology has benefited greatly from previously observed degeneracies in structure. The decomposition of the known structural universe into a finite set of compact TERMs offers exciting opportunities toward better understanding, design, and prediction of protein structure.tertiary motif | structural degeneracy | protein structural universe | sequence-structure relationships | structural modularity I n this work, we aim to decompose the protein structure space into its basic elements as a way of understanding its design principles and describing its limits. Reductionist representations of protein structure have been of long-standing interest (1), with many studies having shown degeneracy at various structural levels (2-5). Features ranging from backbone or side-chain dihedral angles (6, 7) to domains and folds (8, 9) have been classified, and structural basins of attraction have been found in select motifs (10-17), offering a glimpse of a modular space with frequently repeating elements. The reason behind this modularity appears to be a combination of evolutionary history and the fundamental physics of structure. In particular, degeneracy at the level of domains, folds, and functional modules is likely strongly influenced by evolution, whereby such elements recur in different proteins often because of a common ancestor (18,19). On the other hand, statistics of more detailed structural features are better explained from the thermodynamic perspective. For example, observed Ramachandran backbone dihedral angle preferences are largely determined...
The Protein Data Bank (PDB) has been an integral resource for shaping our fundamental understanding of protein structure and for the advancement of such applications as protein design and structure prediction. Over the years, information from the PDB has been used to generate models ranging from specific structural mechanisms to general statistical potentials. With accumulating structural data, it has become possible to mine for more complete and complex structural observations, deducing more accurate generalizations. Motif libraries, which capture recurring structural features along with their sequence preferences, have exposed modularity in the structural universe and found successful application in various problems of structural biology. Here we summarize recent achievements in this arena, focusing on sub-domain level structural patterns and their applications to protein design and structure prediction, and suggest promising future directions as the structural database continues to grow.
Engineered proteins generally must possess a stable structure in order to achieve their designed function. Stable designs, however, are astronomically rare within the space of all possible amino acid sequences. As a consequence, many designs must be tested computationally and experimentally in order to find stable ones, which is expensive in terms of time and resources. Here we use a high-throughput, low-fidelity assay to experimentally evaluate the stability of approximately 200,000 novel proteins. These include a wide range of sequence perturbations, providing a baseline for future work in the field. We build a neural network model that predicts protein stability given only sequences of amino acids, and compare its performance to the assayed values. We also report another network model that is able to generate the amino acid sequences of novel stable proteins given requested secondary sequences. Finally, we show that the predictive model—despite weaknesses including a noisy data set—can be used to substantially increase the stability of both expert-designed and model-generated proteins.
Engineered proteins generally must possess a stable structure in order to achieve their designed function. Stable designs, however, are astronomically rare within the space of all possible amino acid sequences. As a consequence, many designs must be tested computationally and experimentally in order to find stable ones, which is expensive in terms of time and resources. Here we report a neural network model that predicts protein stability based only on sequences of amino acids, and demonstrate its performance by evaluating the stability of almost 200,000 novel proteins. These include a wide range of sequence perturbations, providing a baseline for future work in the field. We also report a second neural network model that is able to generate novel stable proteins. Finally, we show that the predictive model can be used to substantially increase the stability of both expert-designed and model-generated proteins.
MicroRNAs coordinate networks of mRNAs, but predicting specific sites of interactions is complicated by the very few bases of complementarity needed for regulation. Although efforts to characterize the specific requirements for microRNA (miR) regulation have made some advances, no general model of target recognition has been widely accepted. In this work, we describe an entirely novel approach to miR target identification. The genomic events responsible for the creation of individual miR loci have now been described with many miRs now known to have been initially formed from transposable element (TE) sequences. In light of this, we propose that limiting miR target searches to transcripts containing a miR’s progenitor TE can facilitate accurate target identification. In this report we outline the methodology behind OrbId (Origin-based identification of microRNA targets). In stark contrast to the principal miR target algorithms (which rely heavily on target site conservation across species and are therefore most effective at predicting targets for older miRs), we find OrbId is particularly efficacious at predicting the mRNA targets of miRs formed more recently in evolutionary time. After defining the TE origins of > 200 human miRs, OrbId successfully generated likely target sets for 191 predominately primate-specific human miR loci. While only a handful of the loci examined were well enough conserved to have been previously evaluated by existing algorithms, we find ~80% of the targets for the oldest miR (miR-28) in our analysis contained within the principal Diana and TargetScan prediction sets. More importantly, four of the 15 OrbId miR-28 putative targets have been previously verified experimentally. In light of OrbId proving best-suited for predicting targets for more recently formed miRs, we suggest OrbId makes a logical complement to existing, conservation based, miR target algorithms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.