Hit songs, books, and movies are many times more successful than average, suggesting that "the best" alternatives are qualitatively different from "the rest"; yet experts routinely fail to predict which products will succeed. We investigated this paradox experimentally, by creating an artificial "music market" in which 14,341 participants downloaded previously unknown songs either with or without knowledge of previous participants' choices. Increasing the strength of social influence increased both inequality and unpredictability of success. Success was also only partly determined by quality: The best songs rarely did poorly, and the worst rarely did well, but any other result was possible.
Standard statistical methods often provide no way to make accurate estimates about the characteristics of hidden populations such as injection drug users, the homeless, and artists. In this paper, we further develop a sampling and estimation technique called respondent-driven sampling, which allows researchers to make asymptotically unbiased estimates about these hidden populations. The sample is selected with a snowball-type 193 design that can be done more cheaply, quickly, and easily than other methods currently in use. Further, we can show that under certain specified (and quite general) conditions, our estimates for the percentage of the population with a specific trait are asymptotically unbiased. We further show that these estimates are asymptotically unbiased no matter how the seeds are selected. We conclude with a comparison of respondent-driven samples of jazz musicians in New York and San Francisco, with corresponding institutional samples of jazz musicians from these cities. The results show that some standard methods for studying hidden populations can produce misleading results.
Respondent-driven sampling (RDS) is a network-based technique for estimating traits in hard-to-reach populations, for example, the prevalence of HIV among drug injectors. In recent years RDS has been used in more than 120 studies in more than 20 countries and by leading public health organizations, including the Centers for Disease Control and Prevention in the United States. Despite the widespread use and growing popularity of RDS, there has been little empirical validation of the methodology. Here we investigate the performance of RDS by simulating sampling from 85 known, network populations. Across a variety of traits we find that RDS is substantially less accurate than generally acknowledged and that reported RDS confidence intervals are misleadingly narrow. Moreover, because we model a best-case scenario in which the theoretical RDS sampling assumptions hold exactly, it is unlikely that RDS performs any better in practice than in our simulations. Notably, the poor performance of RDS is driven not by the bias but by the high variance of estimates, a possibility that had been largely overlooked in the RDS literature. Given the consistency of our results across networks and our generous sampling conditions, we conclude that RDS as currently practiced may not be suitable for key aspects of public health surveillance where it is now extensively applied.disease surveillance | snowball sampling | social networks T he development and evaluation of public health policies often require detailed information about so-called hard-to-reach or hidden populations. For example, HIV researchers are especially interested in monitoring risk behavior and disease prevalence among injection drug users, men who have sex with men, and commercial sex workers-the groups at highest risk for HIV in most countries. Unfortunately, however, these high-risk groups are not easily studied with standard sampling methods, including institutional sampling, targeted sampling, and time-location sampling (1).Respondent-driven sampling (RDS) (2-4) facilitates examination of such hidden populations via a chain-referral procedure in which participants recruit one another, akin to snowball sampling. RDS is now widely used in the public health community and has been recently applied in more than 120 studies in more than 20 countries, involving a total of more than 32,000 participants (5). In particular, in helping to track the HIV epidemic, RDS is used by the Centers for Disease Control and Prevention (CDC) (6, 7) and by the United States President's Emergency Plan for AIDS Relief.RDS is a method both for data collection and for statistical inference. To generate an RDS sample, one begins by selecting a small number of initial participants ("seeds") from the target population who are asked-and typically provided financial incentive-to recruit their contacts in the population (2). The sampling proceeds with current sample members recruiting the next wave of sample members, continuing until the desired sample size is reached. Participants are usually all...
Hidden populations, such as injection drug users and sex workers, are central to a number of public health problems. However, because of the nature of these groups, it is difficult to collect accurate information about them, and this difficulty complicates disease prevention efforts. A recently developed statistical approach called respondent-driven sampling improves our ability to study hidden populations by allowing researchers to make unbiased estimates of the prevalence of certain traits in these populations. Yet, not enough is known about the sample-to-sample variability of these prevalence estimates. In this paper, we present a bootstrap method for constructing confidence intervals around respondent-driven sampling estimates and demonstrate in simulations that it outperforms the naive method currently in use. We also use simulations and real data to estimate the design effects for respondent-driven sampling in a number of situations. We conclude with practical advice about the power calculations that are needed to determine the appropriate sample size for a study using respondent-driven sampling. In general, we recommend a sample size twice as large as would be needed under simple random sampling.
Summary Respondent-driven sampling (RDS) is a widely used method for sampling from hard-to-reach human populations, especially populations at higher risk for HIV. Data are collected through peer-referral over social networks. RDS has proven practical for data collection in many difficult settings and is widely used. Inference from RDS data requires many strong assumptions because the sampling design is partially beyond the control of the researcher and partially unobserved. We introduce diagnostic tools for most of these assumptions and apply them in 12 high risk populations. These diagnostics empower researchers to better understand their data and encourage future statistical research on RDS.
In this paper we develop a method to estimate both individual social network size (i.e., degree) and the distribution of network sizes in a population by asking respondents how many people they know in specific subpopulations (e.g., people named Michael). Building on the scale-up method of Killworth et al. (1998b) and other previous attempts to estimate individual network size, we propose a latent non-random mixing model which resolves three known problems with previous approaches. As a byproduct, our method also provides estimates of the rate of social mixing between population groups. We demonstrate the model using a sample of 1,370 adults originally collected by McCarty et al. (2001). Based on insights developed during the statistical modeling, we conclude by offering practical guidelines for the design of future surveys to estimate social network size. Most importantly, we show that if the first names to be asked about are chosen properly, the simple scale-up degree estimates can enjoy the same bias-reduction as that from the our more complex latent non-random mixing model.
Respondent-driven sampling (RDS) is a recently introduced, and now widely used, technique for estimating disease prevalence in hidden populations. RDS data are collected through a snowball mechanism, in which current sample members recruit future sample members. In this paper we present respondent-driven sampling as Markov chain Monte Carlo (MCMC) importance sampling, and we examine the effects of community structure and the recruitment procedure on the variance of RDS estimates. Past work has assumed that the variance of RDS estimates is primarily affected by segregation between healthy and infected individuals. We examine an illustrative model to show that this is not necessarily the case, and that bottlenecks anywhere in the networks can substantially affect estimates. We also show that variance is inflated by a common design feature in which sample members are encouraged to recruit multiple future sample members. The paper concludes with suggestions for implementing and evaluating respondent-driven sampling studies.
Estimating sizes of hidden or hard-to-reach populations is an important problem in public health. For example, estimates of the sizes of populations at highest risk for HIV and AIDS are needed for designing, evaluating and allocating funding for treatment and prevention programmes. A promising approach to size estimation, relatively new to public health, is the network scale-up method (NSUM), involving two steps: estimating the personal network size of the members of a random sample of a total population and, with this information, estimating the number of members of a hidden subpopulation of the total population. We describe the method, including two approaches to estimating personal network sizes (summation and known population). We discuss the strengths and weaknesses of each approach and provide examples of international applications of the NSUM in public health. We conclude with recommendations for future research and evaluation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.