Sampling is a commonly used technique for studying structural properties of online social networks (OSNs). Due to privacy, business, and performance concerns, OSN service providers impose limitations on data access for third parties. The implication of this practice is that one needs to come up with an applicable sampling scheme that can function under these limitations to efficiently estimate structural properties of interest. In this paper, we study how accurately some important properties of graphs can be estimated under a limited data access model. More specifically, we consider random neighbor access (RNA) model as a rather limited data access model in OSNs. In the RNA model, the only query available to get data from the studied graph is the random neighbor query which returns the id of a random neighbor for a given vertex id. We propose various sampling schemes and estimators for average degree and network size under the RNA model. We conduct extensive experiments on both real world OSN graphs and synthetic graphs (1) to measure the performance of the proposed estimators and (2) to identify the factors affecting the accuracy of our estimators. We find that while the average degree estimators can make accurate estimations with reasonable sample sizes despite the extreme data access limitations of the RNA model, network size estimators require quite large sample sizes for accurate estimations. Figure 2: Proposed Estimators: Illustration of which estimator is used based on the sampling scheme and probing type.Probing type is applicable only on ERSRW sampling scheme. We propose average degree estimator under only ERSRW sampling, so choosing the sampling scheme step is not applicable.1 We use the term estimation performance as a combined measure of precision and 4 precision of the estimation; while in the estimation of the network size, it increases the accuracy, especially when the sampling fraction f > 1. 3. The dynamic nature of the underlying graph adds one more layer to the complexity of the estimation problem. The accuracy of the estimation is limited by how fast the samples can be collected and how fast the property of interest changes. As opposed to the static graph case, larger sample sizes do not provide better estimation results especially when the property of interest increases or decreases over time as the old data becomes unrepresentative of the current data.The outline of the paper is as follows: Section 2 presents the background and the related work. Section 3 presents the RNA model and sampling designs. Section 4 presents the estimators for the RNA model. Section 5 presents our experimental evaluations. Section 6 discusses the practical issues. Section 7 concludes the paper.
Background and Related WorkThe RNA model enables us to perform a walk, but not a SRW 2 , on the underlying graph. Nevertheless, the estimation techniques proposed under SRW sampling form the basis for those under the RNA model as the sampling schemes that we use under the RNA model, namely RSRW and ERSRW, are the modificatio...