, and the Federal Reserve Board for helpful comments and to Mark Fontana for excellent research assistance. The views expressed herein are those of the authors and do not necessarily reflect the views of the National Bureau of Economic Research. NBER working papers are circulated for discussion and comment purposes. They have not been peerreviewed or been subject to the review by the NBER Board of Directors that accompanies official NBER publications.
Data analysts often need to transform an existing dataset, such as with filtering, into a new dataset for downstream analysis. Even the most trivial of mistakes in this phase can introduce bias and lead to the formation of invalid conclusions. For example, consider a researcher identifying subjects for trials of a new statin drug. She might identify patients with a high dietary cholesterol intake as a population likely to benefit from the drug, however, selection of these individuals could bias the test population to those with a generally unhealthy lifestyle, thereby compromising the analysis. Reducing the potential for bias in the dataset transformation process can minimize the need to later engage in the tedious, time-consuming process of trying to eliminate bias while preserving the target dataset. We propose a novel interaction model for explain-andrepair data transformation systems, in which users interactively define constraints for transformation code and the resultant data. The system satisfies these constraints as far as possible, and provides an explanation for any problems encountered. We present an algorithm that yields filterbased transformation code satisfying user constraints. We implemented and evaluated a prototype of this architecture, Emeril, using both synthetic and real-world datasets. Our approach finds solutions 34% more often and 77% more quickly than the previous state-of-the-art solution.
Nowcasting is the practice of using social media data to quantify ongoing real-world phenomena. It has been used by researchers to measure flu activity, unemployment behavior, and more. However, the typical nowcasting workflow requires either slow and tedious manual searching of relevant social media messages or automated statistical approaches that are prone to spurious and low-quality results. In this paper, we propose a method for declaratively specifying a nowcasting model; this method involves processing a user query over a very large social media database, which can take hours. Due to the human-in-the-loop nature of constructing nowcasting models, slow runtimes place an extreme burden on the user. Thus we also propose a novel set of query optimization techniques, which allow users to quickly construct nowcasting models over very large datasets. Further, we propose a novel query quality alarm that helps users estimate phenomena even when historical ground truth data is not available. These contributions allow us to build a declarative nowcasting data management system, RaccoonDB, which yields high-quality results in interactive time. We evaluate RaccoonDB using 40 billion tweets collected over five years. We show that our automated system saves work over traditional manual approaches while improving result quality-57% more accurate in our user study-and that its query optimizations yield a 424x speedup, allowing it to process queries 123x faster than a 300-core Spark cluster, using only 10% of the computational resources.
No abstract
Social media nowcasting--using online user activity to describe real-world phenomena--is an active area of research to supplement more traditional and costly data collection methods such as phone surveys. Given the potential impact of such research, we would expect general-purpose nowcasting systems to quickly become a standard tool among noncomputer scientists, yet it has largely remained a research topic. We believe a major obstacle to widespread adoption is the nowcasting feature selection problem. Typical nowcasting systems require the user to choose a handful of social media objects from a pool of billions of potential candidates, which can be a time-consuming and error-prone process. We have built RINGTAIL, a nowcasting system that helps the user by automatically suggesting high-quality signals. We demonstrate that RINGTALL can make nowcasting easier by suggesting relevant features for a range of topics. The user provides just a short topic query (e.g., unemployment) and a small conventional dataset in order for RINGTALL to quickly return a usable predictive nowcasting model.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.