Sublinear algorithms for outlier detection and generalized closeness testing

Acharya, Jayadev; Jafarpour, Ashkan; Orlitsky, Alon; Suresh, Ananda Theertha

doi:10.1109/isit.2014.6875425

Cited by 16 publications

(24 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, we remark that the binary classification problem is closely related with the so-called two sample homogeneity testing problem [9, Sec. II-C] and the closeness testing problem [10], [11], [12] where given two i.i.d. generated sequences X N and Y n , one aims to determine whether the two sequences are generated according to the same distribution or not.…”

Section: B Main Resultsmentioning

confidence: 99%

Second-order asymptotically optimal statistical classification

Zhou

Tan

Motani

2019

Information and Inference: A Journal of the IMA

View full text Add to dashboard Cite

Motivated by real-world machine learning applications, we analyze approximations to the non-asymptotic fundamental limits of statistical classification. In the binary version of this problem, given two training sequences generated according to two unknown distributions P1 and P2, one is tasked to classify a test sequence which is known to be generated according to either P1 or P2. This problem can be thought of as an analogue of the binary hypothesis testing problem but in the present setting, the generating distributions are unknown. Due to finite sample considerations, we consider the second-order asymptotics (or dispersion-type) tradeoff between type-I and type-II error probabilities for tests which ensure that (i) the type-I error probability for all pairs of distributions decays exponentially fast and (ii) the type-II error probability for a particular pair of distributions is non-vanishing. We generalize our results to classification of multiple hypotheses with the rejection option.

show abstract

Section: B Main Resultsmentioning

confidence: 99%

Second-order asymptotically optimal statistical classification

Zhou

Tan

Motani

2019

Information and Inference: A Journal of the IMA

View full text Add to dashboard Cite

show abstract

“…Importantly, our tester straightforwardly extends to unequal-sized samples, giving the first optimal tester in this setting. Closeness testing with unequal sized samples was considered in [AJOS14] that gives sample upper and lower bounds with a polynomial gap between them. Our tester uses m 1 = Ω(max(n 2/3 /ǫ 4/3 , n 1/2 /ǫ 2 )) samples from one distribution and m 2 = O(max(nm −1/2 1 /ǫ 2 , √ n/ǫ 2 )) from the other.…”

Section: Our Contributionsmentioning

confidence: 99%

A New Approach for Testing Properties of Discrete Distributions

Diakonikolas

Kane

2016

2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS)

101

233

View full text Add to dashboard Cite

We study problems in distribution property testing: Given sample access to one or more unknown discrete distributions, we want to determine whether they have some global property or are ǫ-far from having the property in ℓ 1 distance (equivalently, total variation distance, or "statistical distance"). In this work, we give a novel general approach for distribution testing. We describe two techniques: our first technique gives sample-optimal testers, while our second technique gives matching sample lower bounds. As a consequence, we resolve the sample complexity of a wide variety of testing problems.Our upper bounds are obtained via a modular reduction-based approach. Our approach yields optimal testers for numerous problems by using a standard ℓ 2 -identity tester as a blackbox. Using this recipe, we obtain simple estimators for a wide range of problems, encompassing most problems previously studied in the TCS literature, namely: (1) identity testing to a fixed distribution, (2) closeness testing between two unknown distributions (with equal/unequal sample sizes), (3) independence testing (in any number of dimensions), (4) closeness testing for collections of distributions, and (5) testing histograms. For all of these problems, our testers are sample-optimal, up to constant factors. With the exception of (1), ours are the first sampleoptimal testers for the corresponding problems. Moreover, our estimators are significantly simpler to state and analyze compared to previous results.As an important application of our reduction-based technique, we obtain the first nearly instance-optimal algorithm for testing equivalence between two unknown distributions. The sample complexity of our algorithm depends on the structure of the unknown distributions -as opposed to merely their domain size -and is much better compared to the worst-case optimal ℓ 1 -tester in most natural instances. Moreover, our technique naturally generalizes to other metrics beyond the ℓ 1 -distance. As an illustration of its flexibility, we use it to obtain the first near-optimal equivalence tester under the Hellinger distance.Our lower bounds are obtained via a direct information-theoretic approach: Given a candidate hard instance, our proof proceeds by bounding the mutual information between appropriate random variables. While this is a classical method in information theory, prior to our work, it had not been used in distribution property testing. Previous lower bounds relied either on the birthday paradox, or on moment-matching and were thus restricted to symmetric properties. Our lower bound approach does not suffer from any such restrictions and gives tight sample lower bounds for the aforementioned problems.

show abstract

“…The probability estimates are accurate up to a standard deviation of 0.003. The results of Figure 2 indicate the accuracy of the approximation predicted by (42) assuming that µ is a uniform distribution on an alphabet of size 8. Clearly, the approximation is quite accurate in this regime.…”

Section: B Outlier Hypothesis Testingmentioning

confidence: 77%

“…The objective is to determine whether or not both strings are drawn from identical distributions in P(Z). Homogeneity testing is also closely related to the problem of closeness testing [40]- [42]. As before we work in the regime where m and n are linearly related as m = λn where λ is a known constant.…”

Section: Homogeneity Testingmentioning

confidence: 99%

Weak Convergence Analysis of Asymptotically Optimal Hypothesis Tests

Unnikrishnan

Huang

2016

IEEE Trans. Inform. Theory

View full text Add to dashboard Cite

In recent years solutions to various hypothesis testing problems in the asymptotic setting have been proposed using results from large deviations theory. Such tests are optimal in terms of appropriately defined error-exponents. For the practitioner, however, error probabilities in the finite sample size setting are more important. In this paper we show how results on weak convergence of the test statistic can be used to obtain better approximations for the error probabilities in the finite sample size setting. While this technique is popular among statisticians for common tests, we demonstrate its applicability for several recently proposed asymptotically optimal tests, including tests for robust goodness of fit, homogeneity tests, outlier hypothesis testing, and graphical model estimation.

show abstract

Sublinear algorithms for outlier detection and generalized closeness testing

Cited by 16 publications

References 17 publications

Second-order asymptotically optimal statistical classification

Second-order asymptotically optimal statistical classification

A New Approach for Testing Properties of Discrete Distributions

Weak Convergence Analysis of Asymptotically Optimal Hypothesis Tests

Contact Info

Product

Resources

About