In directed graphs, relationships are asymmetric and these asymmetries contain essential structural information about the graph. Directed relationships lead to a new type of clustering that is not feasible in undirected graphs. We propose a spectral co-clustering algorithm called DI-SIM for asymmetry discovery and directional clustering. A Stochastic co-Blockmodel is introduced to show favorable properties of DI-SIM. To account for the sparse and highly heterogeneous nature of directed networks, DI-SIM uses the regularized graph Laplacian and projects the rows of the eigenvector matrix onto the sphere. A nodewise ASYMMETRY SCORE and DI-SIM are used to analyze the clustering asymmetries in the networks of Enron emails, political blogs, and the Caenorhabditis elegans chemical connectome. In each example, a subset of nodes have clustering asymmetries; these nodes send edges to one cluster, but receive edges from another cluster. Such nodes yield insightful information (e.g., communication bottlenecks) about directed networks, but are missed if the analysis ignores edge direction.C lustering is widely used to study the structure of social, biological, and technological networks because it provides an aggregated and simplified representation of the complex interactions. The difficulty of the clustering problem has inspired an extensive literature devoted to the statistical and computational issues. Spectral approximation algorithms have become popular due to their computational speed and empirical performance across domain areas.In the clustering literature, the vast majority of the models and algorithms presumes that the interactions are symmetric or undirected. In some settings, the relationships can be well approximated as symmetric. However, asymmetric or directed relationships more fully represent the vast majority of interactions. For example, in the gene regulatory network, one gene drives the transcription of the other gene. In the power grid network, electricity flows from one node to the other. In a communication network, one node initiates the conversation. In other examples, it might be easier to observe the relationship without direction, but the direction remains of fundamental importance. For example, in a social network, a business searching for "trend leaders" wants to know the direction of influence in relationships, which is not directly observable. In a regulatory network, knockout experiments seek to estimate the direction of gene regulation. For many questions of interest, making the edges undirected does not provide an appropriate approximation. In all of these examples, the direction of the edges is essential to the function of the network. Directionality gives asymmetry to a relationship and the standard notion of clustering is insufficient to explore and appropriately aggregate asymmetric relationships in our data examples.To extend clustering to directed networks, we use Hartigan's notion of co-clustering, which he proposed as a way to simultaneously cluster both the rows and the columns o...
In the high dimensional Stochastic Blockmodel for a random network, the number of clusters (or blocks) K grows with the number of nodes N . Two previous studies have examined the statistical estimation performance of spectral clustering and the maximum likelihood estimator under the high dimensional model; neither of these results allow K to grow faster than N 1/2 . We study a model where, ignoring log terms, K can grow proportionally to N . Since the number of clusters must be smaller than the number of nodes, no reasonable model allows K to grow faster; thus, our asymptotic results are the "highest" dimensional. To push the asymptotic setting to this extreme, we make additional assumptions that are motivated by empirical observations in physical anthropology (Dunbar, 1992), and an in depth study of massive empirical networks (Leskovec, Lang, Dasgupta, and Mahoney, 2008). Furthermore, we develop a regularized maximum likelihood estimator that leverages these insights and we prove that, under certain conditions, the proportion of nodes that the regularized estimator misclusters converges to zero. This is the first paper to explicitly introduce and demonstrate the advantages of statistical regularization in a parametric form for network analysis.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.