Assigning biological functions to uncharacterized proteins is a fundamental problem in the postgenomic era. The increasing availability of large amounts of data on protein-protein interactions (PPIs) has led to the emergence of a considerable number of computational methods for determining protein function in the context of a network. These algorithms, however, treat each functional class in isolation and thereby often suffer from the difficulty of the scarcity of labeled data. In reality, different functional classes are naturally dependent on one another. We propose a new algorithm, Multi-label Correlated Semi-supervised Learning (MCSL), to incorporate the intrinsic correlations among functional classes into protein function prediction by leveraging the relationships provided by the PPI network and the functional class network. The guiding intuition is that the classification function should be sufficiently smooth on subgraphs where the respective topologies of these two networks are a good match. We encode this intuition as regularized learning with intraclass and interclass consistency, which can be understood as an extension of the graph-based learning with local and global consistency (LGC) method. Cross validation on the yeast proteome illustrates that MCSL consistently outperforms several state-of-the-art methods. Most notably, it effectively overcomes the problem associated with scarcity of label data. The supplementary files are freely available at http://sites.google.com/site/csaijiang/MCSL.
Abstract. We propose here a multi-label semi-supervised learning algorithm, PfunBG, to predict protein functions, employing a bi-relational graph (BG) of proteins and function annotations. Different from most, if not all, existing methods that only consider the partially labeled proteinprotein interaction (PPI) network, the BG comprises three components, a PPI network, a function class graph induced from function annotations of such proteins, and a bipartite graph induced from function assignments. By referring to proteins and function classes equally as vertices, we exploit network propagation to measure how closely a specific function class is related to a protein of interest. The experiments on a yeast PPI network illustrate its effectiveness and efficiency.
BackgroundProteins that interact in vivo tend to reside within the same or "adjacent" subcellular compartments. This observation provides opportunities to reveal protein subcellular localization in the context of the protein-protein interaction (PPI) network. However, so far, only a few efforts based on heuristic rules have been made in this regard.ResultsWe systematically and quantitatively validate the hypothesis that proteins physically interacting with each other probably share at least one common subcellular localization. With the result, for the first time, four graph-based semi-supervised learning algorithms, Majority, χ2-score, GenMultiCut and FunFlow originally proposed for protein function prediction, are introduced to assign "multiplex localization" to proteins. We analyze these approaches by performing a large-scale cross validation on a Saccharomyces cerevisiae proteome compiled from BioGRID and comparing their predictions for 22 protein subcellular localizations. Furthermore, we build an ensemble classifier to associate 529 unlabeled and 137 ambiguously-annotated proteins with subcellular localizations, most of which have been verified in the previous experimental studies.ConclusionsPhysical interaction of proteins has actually provided an essential clue for their co-localization. Compared to the local approaches, the global algorithms consistently achieve a superior performance.
We show here that the problem of maximizing a family of quantitative functions, encompassing both the modularity (Q-measure) and modularity density (D-measure), for community detection can be uniformly understood as a combinatoric optimization involving the trace of a matrix called modularity Laplacian. Instead of using traditional spectral relaxation, we apply additional nonnegative constraint into this graph clustering problem and design efficient algorithms to optimize the new objective. With the explicit nonnegative constraint, our solutions are very close to the ideal community indicator matrix and can directly assign nodes into communities. The near-orthogonal columns of the solution can be reformulated as the posterior probability of corresponding node belonging to each community. Therefore, the proposed method can be exploited to identify the fuzzy or overlapping communities and thus facilitates the understanding of the intrinsic structure of networks. Experimental results show that our new algorithm consistently, sometimes significantly, outperforms the traditional spectral relaxation approaches.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.