In this chapter we elucidate four main themes. The first is that modern data analyses, including "Big Data" analyses, often rely on data from different sources, which can present challenges in constructing statistical models that can make effective use of all of the data. The second theme is that although data analysis is usually centralized, frequently the final outcome is to provide information or allow decision-making for individuals. Third, data analyses often have multiple uses by design: the outcomes of the analysis are intended to be used by more than one person or group, for more than one purpose. Finally, issues of privacy and confidentiality can cause problems in more subtle ways than are usually considered; we will illustrate this point by discussing a case in which there is substantial and effective political opposition to simply acknowledging the geographic distribution of a health hazard.A researcher analyzes some data and learns something important. What happens next? What does it take for the results to make a difference in people's lives? In this chapter we tell a story -a true story -about a statistical analysis that should have changed government policy, but didn't. The project was a research success that did not make its way into policy, and we think it provides some useful insights into the interplay between locally-collected data, statistical analysis, and individual decision making.
A Dataset Compiled from Many Local SourcesBefore getting to our story we set the stage with a brief discussion of general issues regarding data availability. Some data analysis problems, even large or complicated ones, involve data from a single source or collected through a single mechanism. For example, the U.S. census generates data on hundreds of millions of people using just a few different survey instruments. More typically, * We thank Mike Alvarez for helpful comments and the National Science Foundation for partial support of this work. no raw data may be published. This may be acceptable inasmuch as it still allows publishing of summary statistics and derived quantities, but it might prevent publishing even exemplary plots or tables of raw data, and might make it hard for others to evaluate the validity of the work. Imagine the problems of verifying global temperature changes if the raw data could not be shared.
2Although the researcher or data analyst would always prefer access to all data that can be had, and the ability to publish all data and related analyses, owners or controllers of data often have good reasons not to share information, or, if it is shared, to insist that the data be available only to a restricted group of researchers. Someone who is selling her house may not wish it known that the basement sometimes floods, and a political candidate might be reluctant to answer the question, "Have you ever had an affair?"Data privacy issues can lead to a sort of prisoner's dilemma in which a group of people would benefit if they were all to share their data, but no single person's expected benefi...