Given samples from an unknown multivariate distribution p, is it possible to distinguish whether p is the product of its marginals versus p being far from every product distribution? Similarly, is it possible to distinguish whether p equals a given distribution q versus p and q being far from each other? These problems of testing independence and goodness-of-fit have received enormous attention in statistics, information theory, and theoretical computer science, with sample-optimal algorithms known in several interesting regimes of parameters. Unfortunately, it has also been understood that these problems become intractable in large dimensions, necessitating exponential sample complexity.Motivated by the exponential lower bounds for general distributions as well as the ubiquity of Markov Random Fields (MRFs) in the modeling of high-dimensional distributions, we initiate the study of distribution testing on structured multivariate distributions, and in particular the prototypical example of MRFs: the Ising Model. We demonstrate that, in this structured setting, we can avoid the curse of dimensionality, obtaining sample and time efficient testers for independence and goodness-of-fit. One of the key technical challenges we face along the way is bounding the variance of functions of the Ising model.
The standard linear and logistic regression models assume that the response variables are independent, but share the same linear relationship to their corresponding vectors of covariates.The assumption that the response variables are independent is, however, too strong. In many applications, these responses are collected on nodes of a network, or some spatial or temporal domain, and are dependent. Examples abound in financial and meteorological applications, and dependencies naturally arise in social networks through peer effects. Regression with dependent responses has thus received a lot of attention in the Statistics and Economics literature, but there are no strong consistency results unless multiple independent samples of the vectors of dependent responses can be collected from these models. We present computationally and statistically efficient methods for linear and logistic regression models when the response variables are dependent on a network. Given one sample from a networked linear or logistic regression model and under mild assumptions, we prove strong consistency results for recovering the vector of coefficients and the strength of the dependencies, recovering the rates of standard regression under independent observations. We use projected gradient descent on the negative log-likelihood, or negative log-pseudolikelihood, and establish their strong convexity and consistency using concentration of measure for dependent random variables.
A popular methodology for building binary decision-making classifiers in the presence of imperfect information is to first construct a calibrated non-binary "scoring" classifier, and then to post-process this score to obtain a binary decision. We study various group fairness properties of this methodology, when the non-binary scores are calibrated over all protected groups, and with a variety of post-processing algorithms. Specifically, we show:• There does not exist a general way to post-process a calibrated classifier to equalize protected groups' positive or negative predictive value (PPV or NPV). For certain "nice" calibrated classifiers, either PPV or NPV can be equalized when the post-processor uses different thresholds across protected groups. Still, when the post-processing consists of a single global threshold across all groups, natural fairness properties, such as equalizing PPV in a nontrivial way, do not hold even for "nice" classifiers.• When the post-processing stage is allowed to defer on some decisions (that is, to avoid making a decision by handing off some examples to a separate process), then for the nondeferred decisions, the resulting classifier can be made to equalize PPV, NPV, false positive rate (FPR) and false negative rate (FNR) across the protected groups. This suggests a way to partially evade the impossibility results of Chouldechova and Kleinberg et al., which preclude equalizing all of these measures simultaneously. We also present different deferring strategies and show how they affect the fairness properties of the overall system.We then evaluate our post-processing techniques using the COMPAS data set from 2016. * The order of authors is alphabetical and does not convey any information on the authors' relative contributions.
Given samples from an unknown multivariate distribution p, is it possible to distinguish whether p is the product of its marginals versus p being far from every product distribution? Similarly, is it possible to distinguish whether p equals a given distribution q versus p and q being far from each other? These problems of testing independence and goodness-of-fit have received enormous attention in statistics, information theory, and theoretical computer science, with sample-optimal algorithms known in several interesting regimes of parameters [BFF + 01, Pan08, VV17, ADK15, DK16]. Unfortunately, it has also been understood that these problems become intractable in large dimensions, necessitating exponential sample complexity.Motivated by the exponential lower bounds for general distributions as well as the ubiquity of Markov Random Fields (MRFs) in the modeling of high-dimensional distributions, we initiate the study of distribution testing on structured multivariate distributions, and in particular the prototypical example of MRFs: the Ising Model. We demonstrate that, in this structured setting, we can avoid the curse of dimensionality, obtaining sample and time efficient testers for independence and goodness-of-fit. One of the key technical challenges we face along the way is bounding the variance of functions of the Ising model.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.