The similarity in the three-dimensional structures of homologous proteins imposes strong constraints on their sequence variability. It has long been suggested that the resulting correlations among amino acid compositions at different sequence positions can be exploited to infer spatial contacts within the tertiary protein structure. Crucial to this inference is the ability to disentangle direct and indirect correlations, as accomplished by the recently introduced direct-coupling analysis (DCA). Here we develop a computationally efficient implementation of DCA, which allows us to evaluate the accuracy of contact prediction by DCA for a large number of protein domains, based purely on sequence information. DCA is shown to yield a large number of correctly predicted contacts, recapitulating the global structure of the contact map for the majority of the protein domains examined. Furthermore, our analysis captures clear signals beyond intradomain residue contacts, arising, e.g., from alternative protein conformations, ligand-mediated residue couplings, and interdomain interactions in protein oligomers. Our findings suggest that contacts predicted by DCA can be used as a reliable guide to facilitate computational predictions of alternative protein conformations, protein complex formation, and even the de novo prediction of protein domain structures, contingent on the existence of a large number of homologous sequences which are being rapidly made available due to advances in genome sequencing.statistical sequence analysis | residue-residue covariation | contact map prediction | maximum-entropy modeling
Understanding the molecular determinants of specificity in proteinprotein interaction is an outstanding challenge of postgenome biology. The availability of large protein databases generated from sequences of hundreds of bacterial genomes enables various statistical approaches to this problem. In this context covariance-based methods have been used to identify correlation between amino acid positions in interacting proteins. However, these methods have an important shortcoming, in that they cannot distinguish between directly and indirectly correlated residues. We developed a method that combines covariance analysis with global inference analysis, adopted from use in statistical physics. Applied to a set of >2,500 representatives of the bacterial two-component signal transduction system, the combination of covariance with global inference successfully and robustly identified residue pairs that are proximal in space without resorting to ad hoc tuning parameters, both for heterointeractions between sensor kinase (SK) and response regulator (RR) proteins and for homointeractions between RR proteins. The spectacular success of this approach illustrates the effectiveness of the global inference approach in identifying direct interaction based on sequence information alone. We expect this method to be applicable soon to interaction surfaces between proteins present in only 1 copy per genome as the number of sequenced genomes continues to expand. Use of this method could significantly increase the potential targets for therapeutic intervention, shed light on the mechanism of protein-protein interaction, and establish the foundation for the accurate prediction of interacting protein partners.T he large majority of cellular functions are executed and controlled by interacting proteins. With up to several thousand types of proteins expressed in a typical bacterial cell at a given time, their concerted specific interactions regulate the interplay of biochemical processes that are the essence of life. Many protein interactions are transient, allowing proteins to mate with several partners or travel in cellular space to perform their functions. Understanding these transient interactions is one of the outstanding challenges of systems biology (reviewed in ref. 1). The characterization of the molecular details of the interface formed between known interacting proteins is a requirement for understanding the molecular determinants of protein-protein interaction, the knowledge of which may be important for a variety of applications including synthetic biology, e.g., designing new specific interaction between proteins (reviewed in ref.2), and pharmaceutics, e.g., protein interaction surfaces as drug targets (reviewed in ref.3).Experimental approaches to identify surfaces of interaction between proteins such as surface-scanning mutagenesis and cocrystal structure generation are arduous and/or serendipitous. Cocrystal structures provide the best molecular resolution but are particularly challenging to obtain for transient interactio...
Spatially proximate amino acids in a protein tend to coevolve. A protein's three-dimensional (3D) structure hence leaves an echo of correlations in the evolutionary record. Reverse engineering 3D structures from such correlations is an open problem in structural biology, pursued with increasing vigor as more and more protein sequences continue to fill the data banks. Within this task lies a statistical inference problem, rooted in the following: correlation between two sites in a protein sequence can arise from firsthand interaction but can also be network-propagated via intermediate sites; observed correlation is not enough to guarantee proximity. To separate direct from indirect interactions is an instance of the general problem of inverse statistical mechanics, where the task is to learn model parameters (fields, couplings) from observables (magnetizations, correlations, samples) in large systems. In the context of protein sequences, the approach has been referred to as direct-coupling analysis. Here we show that the pseudolikelihood method, applied to 21-state Potts models describing the statistical properties of families of evolutionarily related proteins, significantly outperforms existing approaches to the direct-coupling analysis, the latter being based on standard mean-field techniques. This improved performance also relies on a modified score for the coupling strength. The results are verified using known crystal structures of specific sequence instances of various protein families. Code implementing the new method can be found at http://plmdca.csc.kth.se/.
We study the small-world networks recently introduced by Watts and Strogatz [Nature 393, 440 (1998)], using analytical as well as numerical tools. We characterize the geometrical properties resulting from the coexistence of a local structure and random long-range connections, and we examine their evolution with size and disorder strength. We show that any finite value of the disorder is able to trigger a "small-world" behaviour as soon as the initial lattice is big enough, and study the crossover between a regular lattice and a "small-world" one. These results are corroborated by the investigation of an Ising model defined on the network, showing for every finite disorder fraction a crossover from a high-temperature region dominated by the underlying one-dimensional structure to a mean-field like low-temperature region. In particular there exists a finite-temperature ferromagnetic phase transition as soon as the disorder strength is finite.
In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e. evolutionarily related protein sequences, to which methods of inverse statistical physics can be applied. Using sequence data as the basis for the inference of Boltzmann distributions from samples of microscopic configurations or observables, it is possible to extract information about evolutionary constraints and thus protein function and structure. Here we give an overview over some biologically important questions, and how statistical-mechanics inspired modeling approaches can help to answer them. Finally, we discuss some open questions, which we expect to be addressed over the next years.
The rational design of enzymes is an important goal for both fundamental and practical reasons. Here, we describe a process to learn the constraints for specifying proteins purely from evolutionary sequence data, design and build libraries of synthetic genes, and test them for activity in vivo using a quantitative complementation assay. For chorismate mutase, a key enzyme in the biosynthesis of aromatic amino acids, we demonstrate the design of natural-like catalytic function with substantial sequence diversity. Further optimization focuses the generative model toward function in a specific genomic context. The data show that sequence-based statistical models suffice to specify proteins and provide access to an enormous space of functional sequences. This result provides a foundation for a general process for evolution-based design of artificial proteins.
In this Letter we study the NP-complete vertex cover problem on finite connectivity random graphs. When the allowed size of the cover set is decreased, a discontinuous transition in solvability and typical-case complexity occurs. This transition is characterized by means of exact numerical simulations as well as by analytical replica calculations. The replica symmetric phase diagram is in excellent agreement with numerical findings up to average connectivity e, where replica symmetry becomes locally unstable.
We introduce a theoretical framework that exploits the everincreasing genomic sequence information for protein structure prediction. Structure-based models are modified to incorporate constraints by a large number of non-local contacts estimated from direct coupling analysis (DCA) of co-evolving genomic sequences. A simple hybrid method, called DCA-fold, integrating DCA contacts with an accurate knowledge of local information (e.g., the local secondary structure) is sufficient to fold proteins in the range of 1-3 Å resolution.protein folding | residue contact prediction | contact map estimation | residue-residue coevolution | statistical potentials P roteins are heteropolymers of amino acids that adopt specific 3D structures to perform designated biological tasks. Enormous experimental efforts have been invested to determine a large number of protein structures. Currently, computational structure prediction methods are reasonably successful in describing interactions among residues close (local) in sequence. Given the limited information for residues that are distant in sequence, success in large-scale structure prediction has depended crucially on known structural motifs available in protein databases. In cases where similarity to proteins of known structures exists, methods like fold recognition and homology modeling (1-3) have been shown as successful and effective, according to the Critical Assessment of Techniques for Protein Structure Prediction (4). Nevertheless, the accuracy of these methods is still in many cases far from the resolution needed to explore protein functions.Here we introduce a new computational approach that exploits information from the rapidly growing genomic sequences to complement the currently limited structural databases. Over the years, a variety of methods has been used to study co-evolution in protein sequences and estimation of residue contacts with mixed success (5-11). Recently, methods based on direct coupling analysis (DCA) (12) were shown to predict 50-300 non-local contacts to 70-80% accuracy for a variety of protein domains (13). DCA is based purely on protein sequence information. It uses covariance in homologous protein sequences as an input and deduces a direct interaction between residues (12). Those with strong direction interactions are shown to be related to structurally conserved residue-residue contacts in the protein fold (12, 13). As the contacts predicted by DCA recapitulate major features of the native contact maps, we developed a simple hybrid method integrating DCA contacts and detailed local information, to fold proteins of up to about 200 amino acids to within 3 Å of the native structures.Our methodology is guided by the energy landscape theory (14), which asserts that in a minimally frustrated, funnel-like energy landscape, native contacts are on average favorable and dominant over non-favorable, non-native ones. This drives proteins smoothly toward their native states. Folding simulations, using native contacts in structure-based models (SBM), have been...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.