Valid Inference Corrected for Outlier Removal

Chen, Shuxiao; Bien, Jacob

doi:10.1080/10618600.2019.1660180

Cited by 30 publications

(28 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…where F ( • ; 0, ν 2 σ 2 , S) denotes the cumulative distribution function of the N (0, ν 2 2 σ 2 ) distribution truncated to the set S. In Section 4, we provide an efficient approach for analytically characterizing the truncation set S λ sib (ν sib ). To avoid numerical issues associated with the truncated normal, we compute (11) using methods described in Chen and Bien [2020].…”

Section: Inference On a Pair Of Sibling Regionsmentioning

confidence: 99%

Tree-Values: selective inference for regression trees

Neufeld¹,

Gao²,

Witten³

2021

Preprint

View full text Add to dashboard Cite

We consider conducting inference on the output of the Classification and Regression Tree (CART) [Breiman et al., 1984] algorithm. A naive approach to inference that does not account for the fact that the tree was estimated from the data will not achieve standard guarantees, such as Type 1 error rate control and nominal coverage. Thus, we propose a selective inference framework for conducting inference on a fitted CART tree. In a nutshell, we condition on the fact that the tree was estimated from the data.We propose a test for the difference in the mean response between a pair of terminal nodes that controls the selective Type 1 error rate, and a confidence interval for the mean response within a single terminal node that attains the nominal selective coverage.Efficient algorithms for computing the necessary conditioning sets are provided. We apply these methods in simulation and to a dataset involving the association between portion control interventions and caloric intake.

show abstract

Section: Inference On a Pair Of Sibling Regionsmentioning

confidence: 99%

Tree-Values: selective inference for regression trees

Neufeld¹,

Gao²,

Witten³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…We prove Theorem 1 in Appendix S1.1. Related results have been used to develop selective inference frameworks for regression (Loftus & Taylor 2015, Yang et al 2016) and outlier detection (Chen & Bien 2020). It follows from Theorem 1 that to compute the p-value defined in (7), it suffices to characterize the one-dimensional set…”

Section: A Test Of No Difference In Means Between Two Clustersmentioning

confidence: 99%

Selective Inference for Hierarchical Clustering

Gao¹,

Bien²,

Witten³

2020

Preprint

Self Cite

View full text Add to dashboard Cite

Testing for a difference in means between two groups is fundamental to answering research questions across virtually every scientific area. Classical tests control the Type I error rate when the groups are defined a priori. However, when the groups are instead defined via a clustering algorithm, then applying a classical test for a difference in means between the groups yields an extremely inflated Type I error rate. Notably, this problem persists even if two separate and independent data sets are used to define the groups and to test for a difference in their means. To address this problem, in this paper, we propose a selective inference approach to test for a difference in means between two clusters obtained from any clustering method. Our procedure controls the selective Type I error rate by accounting for the fact that the null hypothesis was generated from the data. We describe how to efficiently compute exact p-values for clusters obtained using agglomerative hierarchical clustering with many commonly used linkages. We apply our method to simulated data and to single-cell RNA-seq data.

show abstract

“…When standard methods are applied to the cleaned data, the resulting standard errors do not include the uncertainty from the data-cleaning step, such that the standard errors of the two-step approach are underestimated. For instance, Chen & Bien (2017) show that OLS regression after outlier removal results in confidence intervals that are much too small as they do not possess the nominal coverage. Consequently, the p-values from significance tests are too small and could incorrectly suggest significant results.…”

Section: Robust Statisticsmentioning

confidence: 99%

A Robust Bootstrap Test for Mediation Analysis

2019

View full text Add to dashboard Cite

Mediation analysis is central to theory building and testing in organizations research.Management scholars often use linear regression analysis based on normal-theory maximum likelihood estimators to test mediation. However, these estimators are very sensitive to deviations from normality assumptions, such as outliers or heavy tails of the observed distribution. This sensitivity seriously threatens the empirical testing of theory about mediation mechanisms, as many empirical studies lack reporting of outlier treatments and checks on model assumptions. To overcome this threat, we develop a fast and robust mediation method that yields reliable results even when the data deviate from normality assumptions. Simulation studies show that our method is both superior in estimating the effect size and more reliable in assessing its significance than the existing methods. We illustrate the mechanics of our proposed method in three empirical cases and provide freely available software in R and SPSS to enhance its accessibility and adoption by researchers and practitioners.

show abstract

Valid Inference Corrected for Outlier Removal

Cited by 30 publications

References 35 publications

Tree-Values: selective inference for regression trees

Tree-Values: selective inference for regression trees

Selective Inference for Hierarchical Clustering

A Robust Bootstrap Test for Mediation Analysis

Contact Info

Product

Resources

About