Optimal Testing of Discrete Distributions with High Probability

Diakonikolas, Ilias; Gouleakis, Themis; Kane, Daniel M.; Peebles, John; Price, Eric

doi:10.48550/arxiv.2009.06540

Cited by 1 publication

(9 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, Diakonikolas et al (2020) have shown that the dependence on the error probability in the sample complexity of the closeness problem could be better than the log 1/δ found by repeating log 1/δ times the classical algorithm of Chan et al ( 2014) and accepting or rejecting depending on the majority test. More precisely:…”

Section: Batch Settingmentioning

confidence: 99%

“…Since these results turn out to be similarly useful in our subsequent analysis, we summarized them in the following lemma. Lemma 6.2 (Diakonikolas et al (2020)).…”

Section: Batch Settingmentioning

confidence: 99%

“…For general n ≥ 2, we improve the dependence on ε to ε ∨ TV(D 1 , D 2 ) in the best batch algorithm due to Diakonikolas et al (2020), which is known to be optimal up to multiplicative constants. Namely we obtain a sequential closeness testing algorithm using a number of samples given by…”

Section: Modelmentioning

confidence: 99%

“…Testing identity for the uniform distribution was solved by Paninski (2008), then for general distribution by Valiant and Valiant (2017) and finally the high probability version by Diakonikolas et al (2017). Likewise testing closeness was solved by Chan et al (2014), and a distribution dependent complexity was found by Diakonikolas and Kane (2016) and finally the high probability version by Diakonikolas et al (2020). Besides, the problem of testing D 1 = D 2 vs D 1 = D 2 was solved by Daskalakis and Kawase (2017) for n = 2, however the constants are not optimal.…”

Section: Modelmentioning

confidence: 99%

“…However this simple statistic suffers from a principal caveat: its expected value is neither zero nor easily lower bounded when D 1 = D 2 . As a remedy, Diakonikolas et al (2020) propose to use the following statistic:…”

Section: Batch Settingmentioning

confidence: 99%

See 4 more Smart Citations

Sequential algorithms for testing identity and closeness of distributions

Fawzi¹,

Flammarion²,

Garivier³

et al. 2022

Preprint

View full text Add to dashboard Cite

What advantage do sequential procedures provide over batch algorithms for testing properties of unknown distributions? Focusing on the problem of testing whether two distributions D 1 and D 2 on {1, . . . , n} are equal or ε-far, we give several answers to this question. We show that for a small alphabet size n, there is a sequential algorithm that outperforms any batch algorithm by a factor of at least 4 in terms sample complexity. For a general alphabet size n, we give a sequential algorithm that uses no more samples than its batch counterpart, and possibly fewer if the actual distance TV(D 1 , D 2 ) between D 1 and D 2 is larger than ε. As a corollary, letting ε go to 0, we obtain a sequential algorithm for testing closeness when no a priori bound on TV(D 1 , D 2 ) is given that has a sample complexity Õ(TV(D1,D2) 4/3 ): this improves over the Õ( n/ log n TV(D1,D2) 2 ) tester of Daskalakis and Kawase (2017) and is optimal up to multiplicative constants. We also establish limitations of sequential algorithms for the problem of testing identity and closeness: they can improve the worst case number of samples by at most a constant factor. d 2 log log(1/d)) samples. We design the stopping rules according to a time uniform concentration inequality deduced from McDiarmid's inequality, where we use the ideas of Howard et al. (2018, 2020) in order to obtain powers of log log(1/d) instead of log(1/d).We show that the sample complexity for the testing closeness problem given by Eq. ( 1) is optimal up to multiplicative constants in the worst case setting (i.e., when looking for a bound independent of the distributions D 1 and D 2 ). To do so, we construct two families of distributions whose cross TV distance is exactly d ≥ ε and hard to distinguish unless we have a number of samples given by Eq. ( 1). This latter lower bound is based on properties of KL divergence along with Wald's Lemma. Using similar techniques, we also establish upper and lower bounds for testing identity that match up to multiplicative constants.In addition, we establish a lower bound on the number of queries that matches Eq. ( 2) up to multiplicative constants. The proof is inspired by Karp and Kleinberg (2007) who proved lower bounds for testing whether the mean of a sequence of i.i.d. Bernoulli variables is smaller or larger than 1/2. We construct well-chosen distributions D k (for k integer) that are at distance ε k (ε k decreasing to 0) from uniform and then use properties of the Kullback-Leibler's divergence to show that no algorithm can distinguish between D k and uniform using fewer samples than in Eq. ( 2). Note that we could have used the testing closeness lower bound described in the previous paragraph and let ε = 0, however this gives sub-optimal lower bounds.Discussion of the setting and related work It is clearly impossible to test D 1 = D 2 versus D 1 = D 2 in finite time: this is why the slack parameter ε is introduced in this setting. Other authors like Daskalakis and Kawase (2017) make a different choice: they fix no ε, but only req...

show abstract

Section: Batch Settingmentioning

confidence: 99%

“…Since these results turn out to be similarly useful in our subsequent analysis, we summarized them in the following lemma. Lemma 6.2 (Diakonikolas et al (2020)).…”

Section: Batch Settingmentioning

confidence: 99%

Section: Modelmentioning

confidence: 99%