No abstract
We describe inferactive data analysis, so‐named to denote an interactive approach to data analysis with an emphasis on inference after data analysis. Our approach is a compromise between Tukey's exploratory and confirmatory data analysis allowing also for Bayesian data analysis. We see this as a useful step in concrete providing tools (with statistical guarantees) for current data scientists. The basis of inference we use is (a conditional approach to) selective inference, in particular its randomized form. The relevant reference distributions are constructed from what we call a DAG‐DAG—a Data Analysis Generative DAG, and a selective change of variables formula is crucial to any practical implementation of inferactive data analysis via sampling these distributions. We discuss a canonical example of an incomplete cross‐validation test statistic to discriminate between black box models, and a real HIV dataset example to illustrate inference after making multiple queries on data.
A common practice in IV studies is to check for instrument strength, i.e. its association to the treatment, with an F-test from regression. If the F-statistic is above some threshold, usually 10, the instrument is deemed to satisfy one of the three core IV assumptions and used to test for the treatment effect. However, in many cases, the inference on the treatment effect does not take into account the strength test done a priori. In this paper, we show that not accounting for this pretest can severely distort the distribution of the test statistic and propose a method to correct this distortion, producing valid inference. A key insight in our method is to frame the F-test as a randomized convex optimization problem and to leverage recent methods in selective inference. We prove that our method provides conditional and marginal Type I error control. We also extend our method to weak instrument settings. We conclude with a reanalysis of studies concerning the effect of education on earning where we show that not accounting for pre-testing can dramatically alter the original conclusion about education's effects. | INTRODUCTION | MotivationInstrumental variables (IV) is a commonly used approach in economics, epidemiology, genetics, and health policy to estimate the effect of an exposure, treatment, or policy on an outcome of interest; see Angrist and Krueger (2001), Robins (2006), andBaiocchi et al. (2014) for overviews. IV methods require finding variables, known as 1
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.