Adjusting the Benjamini-Hochberg method for controlling the false discovery rate in knockoff assisted variable selection

Sarkar, Santanu; Tang, Cheng Yong

doi:10.48550/arxiv.2102.09080

Cited by 3 publications

(4 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…How can practitioners recognize which is better for their context? Can hybrid methods such as those of Sarkar and Tang (2021), or methods yet to be developed, balance the tradeoffs between the two approaches, preserving the strengths of the knockoffs framework without suffering its drawbacks? By identifying pitfalls for the knockoffs framework our results represent strides toward a more complete understanding of multiple testing in the linear model.…”

Section: Discussionmentioning

confidence: 99%

“…To begin to explain how knockoffs can go wrong, Section 1.2 formally recasts the knockoff filter as a conditional post-selection inference method built around a randomized estimator β = β + ω, where ω is user-generated Gaussian noise in the style of Tian and Taylor (2018). Our interpretation builds on a conditioning argument in Barber and Candès (2019) and an observation in Sarkar and Tang (2021) that knockoffs constructs two independent estimators for β.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Whiteout: when do fixed-X knockoffs fail?

Xiao¹,

Fithian²

2021

Preprint

View full text Add to dashboard Cite

A core strength of knockoff methods is their virtually limitless customizability, allowing an analyst to exploit machine learning algorithms and domain knowledge without threatening the method's robust finitesample false discovery rate control guarantee. While several previous works have investigated regimes where specific implementations of knockoffs are provably powerful, general negative results are more difficult to obtain for such a flexible method. In this work we recast the fixed-X knockoff filter for the Gaussian linear model as a conditional post-selection inference method. It adds user-generated Gaussian noise to the ordinary least squares estimator β to obtain a "whitened" estimator β with uncorrelated entries, and performs inference using sgn( βj) as the test statistic for Hj : βj = 0. We prove equivalence between our whitening formulation and the more standard formulation involving negative control predictor variables, showing how the fixed-X knockoffs framework can be used for multiple testing on any problem with (asymptotically) multivariate Gaussian parameter estimates. Relying on this perspective, we obtain the first negative results that universally upper-bound the power of all fixed-X knockoff methods, without regard to choices made by the analyst. Our results show roughly that, if the leading eigenvalues of Var( β) are large with dense leading eigenvectors, then there is no way to whiten β without irreparably erasing nearly all of the signal, rendering sgn( βj) too uninformative for accurate inference. We give conditions under which the true positive rate (TPR) for any fixed-X knockoff method must converge to zero even while the TPR of Bonferroni-corrected multiple testing tends to one, and we explore several examples illustrating this phenomenon.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Whiteout: when do fixed-X knockoffs fail?

Xiao¹,

Fithian²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…More recently, randomization was used in Li and Fithian (2021) to recast the knockoff procedure of Barber and Candès (2015) as a selective inference procedure for the linear Gaussian model that adds noise to the OLS estimates ( β) to create a "whitened" version of β to use for hypothesis selection. The work of Sarkar and Tang (2021) explores similar ways of using knockoffs to split β into independent Full Dataset (X) Selection Event S(X)…”

Section: Related Work On Data Splitting and Carvingmentioning

confidence: 99%

Data fission: splitting a single data point

Leiner¹,

Duan²,

Wasserman³

et al. 2021

Preprint

View full text Add to dashboard Cite

Suppose we observe a random vector X from some distribution P in a known family with unknown parameters. We ask the following question: when is it possible to split X into two parts f (X) and g(X) such that neither part is sufficient to reconstruct X by itself, but both together can recover X fully, and the joint distribution of (f (X), g(X)) is tractable? As one example, if X = (X1, . . . , Xn) and P is a product distribution, then for any m < n, we can split the sample to define f (X) = (X1, . . . , Xm) and g(X) = (Xm+1, . . . , Xn). Rasines and Young (2021) offers an alternative route of accomplishing this task through randomization of X with additive Gaussian noise which enables post-selection inference in finite samples for Gaussian distributed data and asymptotically for non-Gaussian additive models. In this paper, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data blurring, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on a few prototypical applications, such as post-selection inference for trend filtering and other regression problems.

show abstract

“…Second, after the test statistic is determined for each hypothesis, how to combine these test statistics to derive a method that controls the false discovery rate is challenging. Many existing procedures, such as Ji and Zhao (2014); Candès (2015, 2019); Xing et al (2021); Sarkar and Tang (2021), work on the (generalized) linear regression models. In Candes et al (2018), the authors considered an arbitrary joint distribution of y and x and proposed the model-X knockoff to control FDR.…”

Section: Introductionmentioning

confidence: 99%

On the testing of multiple hypothesis in sliced inverse regression

Xing¹

2022

Preprint

View full text Add to dashboard Cite

We consider the multiple testing of the general regression framework aiming at studying the relationship between a univariate response and a p-dimensional predictor. To test the hypothesis of the effect of each predictor, we construct a mirror statistic based on the estimator of the sliced inverse regression without assuming a model of the conditional distribution of the response. According to the developed limiting distribution results in this paper, we have shown that the mirror statistic is asymptotically symmetric with respect to zero under the null hypothesis. We then propose the Model-free Multiple testing procedure using Mirror statistics and show theoretically that the false discovery rate of this method is less than or equal to a designated level asymptotically. Numerical evidence has shown that the proposed method is much more powerful than its alternatives, subject to the control of the false discovery rate.

show abstract

Adjusting the Benjamini-Hochberg method for controlling the false discovery rate in knockoff assisted variable selection

Cited by 3 publications

References 7 publications

Whiteout: when do fixed-X knockoffs fail?

Whiteout: when do fixed-X knockoffs fail?

Data fission: splitting a single data point

On the testing of multiple hypothesis in sliced inverse regression

Contact Info

Product

Resources

About