Students in statistics or data science usually learn early on that when the sample size n is large relative to the number of variables p, fitting a logistic model by the method of maximum likelihood produces estimates that are consistent and that there are well-known formulas that quantify the variability of these estimates which are used for the purpose of statistical inference. We are often told that these calculations are approximately valid if we have 5 to 10 observations per unknown parameter. This paper shows that this is far from the case, and consequently, inferences produced by common software packages are often unreliable. Consider a logistic model with independent features in which n and p become increasingly large in a fixed ratio. We prove that (i) the maximum-likelihood estimate (MLE) is biased, (ii) the variability of the MLE is far greater than classically estimated, and (iii) the likelihood-ratio test (LRT) is not distributed as a χ2. The bias of the MLE yields wrong predictions for the probability of a case based on observed values of the covariates. We present a theory, which provides explicit expressions for the asymptotic bias and variance of the MLE and the asymptotic distribution of the LRT. We empirically demonstrate that these results are accurate in finite samples. Our results depend only on a single measure of signal strength, which leads to concrete proposals for obtaining accurate inference in finite samples through the estimate of this measure.
Logistic regression is used thousands of times a day to fit data, predict future outcomes, and assess the statistical significance of explanatory variables. When used for the purpose of statistical inference, logistic models produce p-values for the regression coefficients by using an approximation to the distribution of the likelihood-ratio test. Indeed, Wilks' theorem asserts that whenever we have a fixed number p of variables, twice the log-likelihood ratio (LLR) 2Λ is distributed as a χ 2 k variable in the limit of large sample sizes n; here, χ 2 k is a chi-square with k degrees of freedom and k the number of variables being tested. In this paper, we prove that when p is not negligible compared to n, Wilks' theorem does not hold and that the chi-square approximation is grossly incorrect; in fact, this approximation produces p-values that are far too small (under the null hypothesis).Assume that n and p grow large in such a way that p/n → κ for some constant κ < 1/2. We prove that for a class of logistic models, the LLR converges to a rescaled chi-square, namely, 2Λ d → α(κ)χ 2 k , where the scaling factor α(κ) is greater than one as soon as the dimensionality ratio κ is positive. Hence, the LLR is larger than classically assumed. For instance, when κ = 0.3, α(κ) ≈ 1.5. In general, we show how to compute the scaling factor by solving a nonlinear system of two equations with two unknowns. Our mathematical arguments are involved and use techniques from approximate message passing theory, from non-asymptotic random matrix theory and from convex geometry. We also complement our mathematical study by showing that the new limiting distribution is accurate for finite sample sizes.Finally, all the results from this paper extend to some other regression models such as the probit regression model.
This paper rigorously establishes that the existence of the maximum likelihood estimate (MLE) in high-dimensional logistic regression models with Gaussian covariates undergoes a sharp 'phase transition'. We introduce an explicit boundary curve h MLE , parameterized by two scalars measuring the overall magnitude of the unknown sequence of regression coefficients, with the following property: in the limit of large sample sizes n and number of features p proportioned in such a way that p/n → κ, we show that if the problem is sufficiently high dimensional in the sense that κ > h MLE , then the MLE does not exist with probability one. Conversely, if κ < h MLE , the MLE asymptotically exists with probability one.
Cover's resultOne notable exception against this background dates back to the seminal work of Cover [5,6] concerning the separating capacities of decision surfaces. When applied to logistic regression, Cover's
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.