Motivation Understanding the mechanisms and structural mappings between molecules and pathway classes are critical for design of reaction predictors for synthesizing new molecules. This article studies the problem of prediction of classes of metabolic pathways (series of chemical reactions occurring within a cell) in which a given biochemical compound participates. We apply a hybrid machine learning approach consisting of graph convolutional networks used to extract molecular shape features as input to a random forest classifier. In contrast to previously applied machine learning methods for this problem, our framework automatically extracts relevant shape features directly from input SMILES representations, which are atom-bond specifications of chemical structures composing the molecules. Results Our method is capable of correctly predicting the respective metabolic pathway class of 95.16% of tested compounds, whereas competing methods only achieve an accuracy of 84.92% or less. Furthermore, our framework extends to the task of classification of compounds having mixed membership in multiple pathway classes. Our prediction accuracy for this multi-label task is 97.61%. We analyze the relative importance of various global physicochemical features to the pathway class prediction problem and show that simple linear/logistic regression models can predict the values of these global features from the shape features extracted using our framework. Availability and implementation https://github.com/baranwa2/MetabolicPathwayPrediction. Supplementary information Supplementary data are available at Bioinformatics online.
Graph symmetries intervene in diverse applications, from enumeration, to graph structure compression, to the discovery of graph dynamics (e.g., node arrival order inference). Whereas Erdős-Rényi graphs are typically asymmetric, real networks are highly symmetric. So a natural question is whether preferential attachment graphs, where in each step a new node with m edges is added, exhibit any symmetry. In recent work it was proved that preferential attachment graphs are symmetric for m = 1, and there is some nonnegligible probability of symmetry for m = 2. It was conjectured that these graphs are asymmetric when m ≥ 3. We settle this conjecture in the affirmative, then use it to estimate the structural entropy of the model. To do this, we also give bounds on the number of ways that the given graph structure could have arisen by preferential attachment. These results have further implications for information theoretic problems of interest on preferential attachment graphs.
We consider PATRICIA tries on n random binary strings generated by a memoryless source with parameter p ≥ 1 2 . For both the symmetric (p = 1/2) and asymmetric cases, we analyze asymptotics of the expected value of the external profile at level k = k(n), defined to be the number of leaves at level k. We study three natural ranges of k with respect to n. For k bounded, the mean profile decays exponentially with respect to n. For k growing logarithmically with n, the parameter exhibits polynomial growth in n, with some periodic fluctuations. Finally, for k = Θ(n), we see super-exponential decay, again with periodic fluctuations. Our derivations rely on analytic techniques, including Mellin transforms, analytic depoissonization, and the saddle point method. To cover wider ranges of k and n and provide more intuitive insights, we also use methods of applied mathematics, including asymptotic matching and linearization.
A PATRICIA trie is a trie in which non-branching paths are compressed. The external profile B n,k , defined to be the number of leaves at level k of a PATRICIA trie on n nodes, is an important "summarizing" parameter, in terms of which several other parameters of interest can be formulated. Here we derive precise asymptotics for the expected value and variance of B n,k , as well as a central limit theorem with error bound on the characteristic function, for PATRICIA tries on n infinite binary strings generated by a memoryless source with bias p > 1/2 for k ∼ α log n with α ∈ (1/ log(1/q) + , 1/ log(1/p) − ) for any fixed > 0. In this range, E[B n,k ] = Θ(Var[B n,k ]), and both are of the form Θ(n β(α) / √ log n), where the Θ hides bounded, periodic functions in log n whose Fourier series we explicitly determine. The compression property leads to extra terms in the Poisson functional equations for the profile which are not seen in tries or digital search trees, resulting in Mellin transforms which are only implicitly given in terms of the moments of B m,j for various m and j. Thus, the proofs require information about the profile outside the main range of interest. Our derivations rely on analytic techniques, including Mellin transforms, analytic de-Poissonization, the saddle point method, and careful bounding of complex functions.
Background Development of new methods for analysis of protein–protein interactions (PPIs) at molecular and nanometer scales gives insights into intracellular signaling pathways and will improve understanding of protein functions, as well as other nanoscale structures of biological and abiological origins. Recent advances in computational tools, particularly the ones involving modern deep learning algorithms, have been shown to complement experimental approaches for describing and rationalizing PPIs. However, most of the existing works on PPI predictions use protein-sequence information, and thus have difficulties in accounting for the three-dimensional organization of the protein chains. Results In this study, we address this problem and describe a PPI analysis based on a graph attention network, named Struct2Graph, for identifying PPIs directly from the structural data of folded protein globules. Our method is capable of predicting the PPI with an accuracy of 98.89% on the balanced set consisting of an equal number of positive and negative pairs. On the unbalanced set with the ratio of 1:10 between positive and negative pairs, Struct2Graph achieves a fivefold cross validation average accuracy of 99.42%. Moreover, Struct2Graph can potentially identify residues that likely contribute to the formation of the protein–protein complex. The identification of important residues is tested for two different interaction types: (a) Proteins with multiple ligands competing for the same binding area, (b) Dynamic protein–protein adhesion interaction. Struct2Graph identifies interacting residues with 30% sensitivity, 89% specificity, and 87% accuracy. Conclusions In this manuscript, we address the problem of prediction of PPIs using a first of its kind, 3D-structure-based graph attention network (code available at https://github.com/baranwa2/Struct2Graph). Furthermore, the novel mutual attention mechanism provides insights into likely interaction sites through its unsupervised knowledge selection process. This study demonstrates that a relatively low-dimensional feature embedding learned from graph structures of individual proteins outperforms other modern machine learning classifiers based on global protein features. In addition, through the analysis of single amino acid variations, the attention mechanism shows preference for disease-causing residue variations over benign polymorphisms, demonstrating that it is not limited to interface residues.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.