Abstract:Covariance analysis of protein sequence alignments uses coevolving pairs of sequence positions to predict features of protein structure and function. However, current methods ignore the phylogenetic relationships between sequences, potentially corrupting the identification of covarying positions. Here, we use random matrix theory to demonstrate the existence of a power law tail that distinguishes the spectrum of covariance caused by phylogeny from that caused by structural interactions. The power law is essent… Show more
“…This is because our aim here is to quantitatively understand the bases of prediction of interaction partners thanks to contacts. Correlations arising in protein sequences due to their common evolutionary history [34][35][36][37][38][39][40] can further contribute to the success of DCA-based approaches at predicting protein-protein interactions from natural protein sequences [41,42], while they obscure the identification of contacts [6,10,36]. Besides, generalizing the present model to Potts spins and including local fields to tune residue conservation would make it more similar to the models inferred by DCA from natural protein sequences.…”
Identifying protein-protein interactions is crucial for a systems-level understanding of the cell. Recently, algorithms based on inverse statistical physics, e.g. Direct Coupling Analysis (DCA), have allowed to use evolutionarily related sequences to address two conceptually related inference tasks: finding pairs of interacting proteins, and identifying pairs of residues which form contacts between interacting proteins. Here we address two underlying questions: How are the performances of both inference tasks related? How does performance depend on dataset size and the quality? To this end, we formalize both tasks using Ising models defined over stochastic block models, with individual blocks representing single proteins, and inter-block couplings protein-protein interactions; controlled synthetic sequence data are generated by Monte-Carlo simulations. We show that DCA is able to address both inference tasks accurately when sufficiently large training sets of known interaction partners are available, and that an iterative pairing algorithm (IPA) allows to make predictions even without a training set. Noise in the training data deteriorates performance. In both tasks we find a quadratic scaling relating dataset quality and size that is consistent with noise adding in square-root fashion and signal adding linearly when increasing the dataset. This implies that it is generally good to incorporate more data even if its quality is imperfect, thereby shedding light on the empirically observed performance of DCA applied to natural protein sequences.
“…This is because our aim here is to quantitatively understand the bases of prediction of interaction partners thanks to contacts. Correlations arising in protein sequences due to their common evolutionary history [34][35][36][37][38][39][40] can further contribute to the success of DCA-based approaches at predicting protein-protein interactions from natural protein sequences [41,42], while they obscure the identification of contacts [6,10,36]. Besides, generalizing the present model to Potts spins and including local fields to tune residue conservation would make it more similar to the models inferred by DCA from natural protein sequences.…”
Identifying protein-protein interactions is crucial for a systems-level understanding of the cell. Recently, algorithms based on inverse statistical physics, e.g. Direct Coupling Analysis (DCA), have allowed to use evolutionarily related sequences to address two conceptually related inference tasks: finding pairs of interacting proteins, and identifying pairs of residues which form contacts between interacting proteins. Here we address two underlying questions: How are the performances of both inference tasks related? How does performance depend on dataset size and the quality? To this end, we formalize both tasks using Ising models defined over stochastic block models, with individual blocks representing single proteins, and inter-block couplings protein-protein interactions; controlled synthetic sequence data are generated by Monte-Carlo simulations. We show that DCA is able to address both inference tasks accurately when sufficiently large training sets of known interaction partners are available, and that an iterative pairing algorithm (IPA) allows to make predictions even without a training set. Noise in the training data deteriorates performance. In both tasks we find a quadratic scaling relating dataset quality and size that is consistent with noise adding in square-root fashion and signal adding linearly when increasing the dataset. This implies that it is generally good to incorporate more data even if its quality is imperfect, thereby shedding light on the empirically observed performance of DCA applied to natural protein sequences.
“…However, distributed patterns may also be related to phylogenetic correlations, which are present in the data, cf. [46]. As has been shown recently in a heuristic way [47], the decomposition of sequence-data covariance matrices or couplings matrices into a sum of a sparse and a low-rank matrix can substantially improve contact prediction, if only the sparse matrix is used.…”
Statistical models for families of evolutionary related proteins have recently gained interest: in particular pairwise Potts models, as those inferred by the Direct-Coupling Analysis, have been able to extract information about the three-dimensional structure of folded proteins, and about the effect of amino-acid substitutions in proteins. These models are typically requested to reproduce the oneand two-point statistics of the amino-acid usage in a protein family, i.e. to capture the so-called residue conservation and covariation statistics of proteins of common evolutionary origin. Pairwise Potts models are the maximum-entropy models achieving this. While being successful, these models depend on huge numbers of ad hoc introduced parameters, which have to be estimated from finite amount of data and whose biophysical interpretation remains unclear. Here we propose an approach to parameter reduction, which is based on selecting collective sequence motifs. It naturally leads to the formulation of statistical sequence models in terms of Hopfield-Potts models. These models can be accurately inferred using a mapping to restricted Boltzmann machines and persistent contrastive divergence. We show that, when applied to protein data, even 20-40 patterns are sufficient to obtain statistically close-to-generative models. The Hopfield patterns form interpretable sequence motifs and may be used to clusterize amino-acid sequences into functional sub-families. However, the distributed collective nature of these motifs intrinsically limits the ability of Hopfield-Potts models in predicting contact maps, showing the necessity of developing models going beyond the Hopfield-Potts models discussed here.
“…However, 23 additional correlations arise in protein sequences due to their common evolutionary 24 history, i.e. phylogeny [23][24][25], even in the absence of structural or functional constraints. 25 These historical correlations could also be exploited to predict interacting partners, since 26 a pair of interacting proteins can have a more strongly shared phylogenetic history than 27 non-interacting proteins [26].…”
mentioning
confidence: 99%
“…phylogeny [23][24][25], even in the absence of structural or functional constraints. 25 These historical correlations could also be exploited to predict interacting partners, since 26 a pair of interacting proteins can have a more strongly shared phylogenetic history than 27 non-interacting proteins [26]. For instance, interacting proteins tend to have similar 28 evolutionary rates [27,28], which helps to identify them [26].…”
mentioning
confidence: 99%
“…In both panels, data was generated using a tree of 10 generations, with exactly 5 mutations per branch, out of 200 bits in each chain AB, thus yielding a total of 1024 chains AB. Random species with 4 chains AB each were constructed 25. of them were randomly selected to form a training set of 100 pairs AB employed to build the DCA model, and the rest constitutes the testing set, for which we calculated DCA pairing scores (Eq.…”
Determining which proteins interact together is crucial to a systems-level understanding of the cell. Recently, algorithms based on Direct Coupling Analysis (DCA) pairwise maximum-entropy models have allowed to identify interaction partners among the paralogs of ubiquitous prokaryotic proteins families, starting from sequence data alone. Since DCA allows to infer the three-dimensional structure of protein complexes, its success in predicting protein-protein interactions could be mainly based on contacting residues coevolving to remain physicochemically complementary. However, interacting proteins often possess similar evolutionary histories, which also gives rise to correlations among their sequences. What is the role of purely phylogenetic correlations in the performance of DCA-based methods to infer interaction partners? To address this question, we employ controlled synthetic data that only involves phylogeny and no interactions or contacts. We find that DCA accurately identifies the pairs of synthetic sequences that only share evolutionary history. It performs as well as methods explicitly based on sequence similarity, and even slightly better with large and accurate training sets. We further demonstrate the ability of these various methods to correctly predict pairings among actual paralogous proteins with genome proximity but no known direct physical interaction, which illustrates the importance of phylogenetic correlations in real data. However, for actually interacting and strongly coevolving proteins, DCA and mutual information outperform sequence similarity.
Author summaryMany biologically important protein-protein interactions are conserved over evolutionary time scales. This leads to two different signals that can be used to computationally predict interactions between protein families and to identify specific interaction partners. First, the shared evolutionary history leads to highly similar phylogenetic relationships between interacting proteins of the two families. Second, the need to keep the interaction surfaces of partner proteins biophysically compatible causes a correlated amino-acid usage of interface residues. Employing simulated data, we show that the shared history alone can be used to detect partner proteins. Similar accuracies are achieved by algorithms comparing phylogenetic relationships and by coevolutionary methods based on Direct Coupling Analysis, which are a priori designed to detect the second type of signal. Using real sequence data, we show that in cases with shared evolutionary but without known 1/28 physical interactions, both methods work with similar accuracy, while for physically interacting systems, methods based on correlated amino-acid usage outperform purely phylogenetic ones. 119 4/28 524 17/28
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.