Sharad Goel scite author profile

Online publishing, social networks, and web search have dramatically lowered the costs to produce, distribute, and discover news articles. Some scholars argue that such technological changes increase exposure to diverse perspectives, while others worry they increase ideological segregation. We address the issue by examining web browsing histories for 50,000 U.S.-located users who regularly read online news. We find that social networks and search engines increase the mean ideological distance between individuals. However, somewhat counterintuitively, we also find these same channels increase an individual's exposure to material from his or her less preferred side of the political spectrum. Finally, we show that the vast majority of online news consumption is accounted for by individuals simply visiting the home pages of their favorite, typically mainstream, news outlets, tempering the consequences-both positive and negative-of recent technological changes. We thus uncover evidence for both sides of the debate, while also finding that the magnitude of the e↵ects are relatively modest.

show abstract

Algorithmic Decision Making and the Cost of Fairness

Corbett-Davies

et al. 2017

View full text Add to dashboard Cite

Algorithms are now regularly used to decide whether defendants awaiting trial are too dangerous to be released back into the community. In some cases, black defendants are substantially more likely than white defendants to be incorrectly classi ed as high risk. To mitigate such disparities, several techniques have recently been proposed to achieve algorithmic fairness. Here we reformulate algorithmic fairness as constrained optimization: the objective is to maximize public safety while satisfying formal fairness constraints designed to reduce racial disparities. We show that for several past de nitions of fairness, the optimal algorithms that result require detaining defendants above race-speci c risk thresholds. We further show that the optimal unconstrained algorithm requires applying a single, uniform threshold to all defendants. e unconstrained algorithm thus maximizes public safety while also satisfying one important understanding of equality: that all individuals are held to the same standard, irrespective of race. Because the optimal constrained and unconstrained algorithms generally di er, there is tension between improving public safety and satisfying prevailing notions of algorithmic fairness. By examining data from Broward County, Florida, we show that this trade-o can be large in practice. We focus on algorithms for pretrial release decisions, but the principles we discuss apply to other domains, and also to human decision makers carrying out structured decision rules. ACM Reference format:1 We consider racial disparities because they have been at the center of many recent debates in criminal justice, but the same logic applies across a range of possible a ributes, including gender. arXiv:1701.08230v4 [cs.CY]

show abstract

The Structural Virality of Online Diffusion

Anderson²,

et al. 2016

View full text Add to dashboard Cite

V iral products and ideas are intuitively understood to grow through a person-to-person diffusion process analogous to the spread of an infectious disease; however, until recently it has been prohibitively difficult to directly observe purportedly viral events, and thus to rigorously quantify or characterize their structural properties. Here we propose a formal measure of what we label "structural virality" that interpolates between two conceptual extremes: content that gains its popularity through a single, large broadcast and that which grows through multiple generations with any one individual directly responsible for only a fraction of the total adoption. We use this notion of structural virality to analyze a unique data set of a billion diffusion events on Twitter, including the propagation of news stories, videos, images, and petitions. We find that across all domains and all sizes of events, online diffusion is characterized by surprising structural diversity; that is, popular events regularly grow via both broadcast and viral mechanisms, as well as essentially all conceivable combinations of the two. Nevertheless, we find that structural virality is typically low, and remains so independent of size, suggesting that popularity is largely driven by the size of the largest broadcast. Finally, we attempt to replicate these findings with a model of contagion characterized by a low infection rate spreading on a scale-free network. We find that although several of our empirical findings are consistent with such a model, it fails to replicate the observed diversity of structural virality, thereby suggesting new directions for future modeling efforts.

show abstract

Assessing respondent-driven sampling

Goel

Salganik

2010

Proc. Natl. Acad. Sci. U.S.A.

321

349

View full text Add to dashboard Cite

Respondent-driven sampling (RDS) is a network-based technique for estimating traits in hard-to-reach populations, for example, the prevalence of HIV among drug injectors. In recent years RDS has been used in more than 120 studies in more than 20 countries and by leading public health organizations, including the Centers for Disease Control and Prevention in the United States. Despite the widespread use and growing popularity of RDS, there has been little empirical validation of the methodology. Here we investigate the performance of RDS by simulating sampling from 85 known, network populations. Across a variety of traits we find that RDS is substantially less accurate than generally acknowledged and that reported RDS confidence intervals are misleadingly narrow. Moreover, because we model a best-case scenario in which the theoretical RDS sampling assumptions hold exactly, it is unlikely that RDS performs any better in practice than in our simulations. Notably, the poor performance of RDS is driven not by the bias but by the high variance of estimates, a possibility that had been largely overlooked in the RDS literature. Given the consistency of our results across networks and our generous sampling conditions, we conclude that RDS as currently practiced may not be suitable for key aspects of public health surveillance where it is now extensively applied.disease surveillance | snowball sampling | social networks T he development and evaluation of public health policies often require detailed information about so-called hard-to-reach or hidden populations. For example, HIV researchers are especially interested in monitoring risk behavior and disease prevalence among injection drug users, men who have sex with men, and commercial sex workers-the groups at highest risk for HIV in most countries. Unfortunately, however, these high-risk groups are not easily studied with standard sampling methods, including institutional sampling, targeted sampling, and time-location sampling (1).Respondent-driven sampling (RDS) (2-4) facilitates examination of such hidden populations via a chain-referral procedure in which participants recruit one another, akin to snowball sampling. RDS is now widely used in the public health community and has been recently applied in more than 120 studies in more than 20 countries, involving a total of more than 32,000 participants (5). In particular, in helping to track the HIV epidemic, RDS is used by the Centers for Disease Control and Prevention (CDC) (6, 7) and by the United States President's Emergency Plan for AIDS Relief.RDS is a method both for data collection and for statistical inference. To generate an RDS sample, one begins by selecting a small number of initial participants ("seeds") from the target population who are asked-and typically provided financial incentive-to recruit their contacts in the population (2). The sampling proceeds with current sample members recruiting the next wave of sample members, continuing until the desired sample size is reached. Participants are usually all...

show abstract

Predicting consumer behavior with Web search

Goel

Hofman

Lahaie

et al. 2010

Proc. Natl. Acad. Sci. U.S.A.

507

333

View full text Add to dashboard Cite

Recent work has demonstrated that Web search volume can "predict the present," meaning that it can be used to accurately track outcomes such as unemployment levels, auto and home sales, and disease prevalence in near real time. Here we show that what consumers are searching for online can also predict their collective future behavior days or even weeks in advance. Specifically we use search query volume to forecast the opening weekend box-office revenue for feature films, first-month sales of video games, and the rank of songs on the Billboard Hot 100 chart, finding in all cases that search counts are highly predictive of future outcomes. We also find that search counts generally boost the performance of baseline models fit on other publicly available data, where the boost varies from modest to dramatic, depending on the application in question. Finally, we reexamine previous work on tracking flu trends and show that, perhaps surprisingly, the utility of search data relative to a simple autoregressive model is modest. We conclude that in the absence of other data sources, or where small improvements in predictive performance are material, search queries provide a useful guide to the near future.culture | predictions A s people increasingly turn to the Internet for news, information, and research purposes, it is tempting to view online activity at any moment in time as a snapshot of the collective consciousness, reflecting the instantaneous interests, concerns, and intentions of the global population (1, 2). From this perspective, it is a short step to conclude that what people are searching for today is predictive of what they will do in the near future. Consumers contemplating buying a new camera may search to compare models; moviegoers may search to determine the opening date of a new film, or to locate cinemas showing it; and individuals planning a vacation may search for places of interest, to find airline tickets, or to price hotel rooms. If so, it follows that by appropriately aggregating counts of search queries related to retail activity, moviegoing, or travel, one might be able to predict collective behavior of economic, cultural, or political interest. Determining the nature of behavior that can be predicted using search, the accuracy of such predictions, and the time scale over which predictions can be usefully made are therefore all questions of interest.Although previous work has considered the relation between search volume and offline outcomes, researchers have focused on the observation that search "predicts the present" (3, 4), meaning that search volume correlates with contemporaneous events. (8) showed that search volume for handpicked influenza-related queries was correlated with subsequently reported caseloads over the period 2004-2008, and Hulth et al. (9) found similar results in a study of search queries submitted on a Swedish medical Web site. An automated procedure for identifying informative queries is described in Ginsberg et al. (10), and based on that methodology, Google Flu Trend...

show abstract

The structure of online diffusion networks

2012

View full text Add to dashboard Cite

Models of networked diffusion that are motivated by analogy with the spread of infectious disease have been applied to a wide range of social and economic adoption processes, including those related to new products, ideas, norms and behaviors. However, it is unknown how accurately these models account for the empirical structure of diffusion over networks. Here we describe the diffusion patterns arising from seven online domains, ranging from communications platforms to networked games to microblogging services, each involving distinct types of content and modes of sharing. We find strikingly similar patterns across all domains. In particular, the vast majority of cascades are small, and are described by a handful of simple tree structures that terminate within one degree of an initial adopting "seed." In addition we find that structures other than these account for only a tiny fraction of total adoptions; that is, adoptions resulting from chains of referrals are extremely rare. Finally, even for the largest cascades that we observe, we find that the bulk of adoptions often takes place within one degree of a few dominant individuals. Together, these observations suggest new directions for modeling of online adoption processes.

show abstract

Forecasting elections with non-representative polls

Wang

Rothschild

Goel

et al. 2015

International Journal of Forecasting

289

234

View full text Add to dashboard Cite

a b s t r a c tElection forecasts have traditionally been based on representative polls, in which randomly sampled individuals are asked who they intend to vote for. While representative polling has historically proven to be quite effective, it comes at considerable costs of time and money. Moreover, as response rates have declined over the past several decades, the statistical benefits of representative sampling have diminished. In this paper, we show that, with proper statistical adjustment, non-representative polls can be used to generate accurate election forecasts, and that this can often be achieved faster and at a lesser expense than traditional survey methods. We demonstrate this approach by creating forecasts from a novel and highly non-representative survey dataset: a series of daily voter intention polls for the 2012 presidential election conducted on the Xbox gaming platform. After adjusting the Xbox responses via multilevel regression and poststratification, we obtain estimates which are in line with the forecasts from leading poll analysts, which were based on aggregating hundreds of traditional polls conducted during the election cycle. We conclude by arguing that non-representative polling shows promise not only for election forecasting, but also for measuring public opinion on a broad range of social, economic and cultural issues.

show abstract

A large-scale analysis of racial disparities in police stops across the United States

et al. 2020

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Sharad Goel

Filter Bubbles, Echo Chambers, and Online News Consumption

Algorithmic Decision Making and the Cost of Fairness

The Structural Virality of Online Diffusion

Assessing respondent-driven sampling

Predicting consumer behavior with Web search

The structure of online diffusion networks

Forecasting elections with non-representative polls

A large-scale analysis of racial disparities in police stops across the United States

Contact Info

Product

Resources

About