Community detection in networks is one of the most popular topics of modern network science. Communities, or clusters, are usually groups of vertices having higher probability of being connected to each other than to members of other groups, though other patterns are possible. Identifying communities is an ill-defined problem. There are no universal protocols on the fundamental ingredients, like the definition of community itself, nor on other crucial issues, like the validation of algorithms and the comparison of their performances. This has generated a number of confusions and misconceptions, which undermine the progress in the field. We offer a guided tour through the main aspects of the problem. We also point out strengths and weaknesses of popular methods, and give directions to their use.
Algorithms to find communities in networks rely just on structural information and search for cohesive subsets of nodes. On the other hand, most scholars implicitly or explicitly assume that structural communities represent groups of nodes with similar (non-topological) properties or functions. This hypothesis could not be verified, so far, because of the lack of network datasets with information on the classification of the nodes. We show that traditional community detection methods fail to find the metadata groups in many large networks. Our results show that there is a marked separation between structural communities and metadata groups, in line with recent findings. That means that either our current modeling of community structure has to be substantially modified, or that metadata groups may not be recoverable from topology alone.
The empirical validation of community detection methods is often based on available annotations on the nodes that serve as putative indicators of the large-scale network structure. Most often, the suitability of the annotations as topological descriptors itself is not assessed, and without this it is not possible to ultimately distinguish between actual shortcomings of the community detection algorithms on one hand, and the incompleteness, inaccuracy or structured nature of the data annotations themselves on the other. In this work we present a principled method to access both aspects simultaneously. We construct a joint generative model for the data and metadata, and a nonparametric Bayesian framework to infer its parameters from annotated datasets. We assess the quality of the metadata not according to its direct alignment with the network communities, but rather in its capacity to predict the placement of edges in the network. We also show how this feature can be used to predict the connections to missing nodes when only the metadata is available, as well as missing metadata. By investigating a wide range of datasets, we show that while there are seldom exact agreements between metadata tokens and the inferred data groups, the metadata is often informative of the network structure nevertheless, and can improve the prediction of missing nodes. This shows that the method uncovers meaningful patterns in both the data and metadata, without requiring or expecting a perfect agreement between the two.
In this study we map out the large-scale structure of citation networks of science journals and follow their evolution in time by using stochastic block models (SBMs). The SBM fitting procedures are principled methods that can be used to find hierarchical grouping of journals into blocks that show similar incoming and outgoing citations patterns. These methods work directly on the citation network without the need to construct auxiliary networks based on similarity of nodes. We fit the SBMs to the networks of journals we have constructed from the data set of around 630 million citations and find a variety of different types of blocks, such as clusters, bridges, sources, and sinks. In addition we use a recent generalization of SBMs to determine how much a manually curated classification of journals into subfields of science is related to the block structure of the journal network and how this relationship changes in time. The SBM method tries to find a network of blocks that is the best high-level representation of the network of journals, and we illustrate how these block networks (at various levels of resolution) can be used as maps of sci-
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.