We investigate the adversarial robustness of streaming algorithms. In this context, an algorithm is considered robust if its performance guarantees hold even if the stream is chosen adaptively by an adversary that observes the outputs of the algorithm along the stream and can react in an online manner. While deterministic streaming algorithms are inherently robust, many central problems in the streaming literature do not admit sublinear-space deterministic algorithms; on the other hand, classical space-efficient randomized algorithms for these problems are generally not adversarially robust. This raises the natural question of whether there exist efficient adversarially robust (randomized) streaming algorithms for these problems.
In this paper, we resolve the one-pass space complexity of perfect L p sampling for p ∈ (0, 2) in a stream. Given a stream of updates (insertions and deletions) to the coordinates of an underlying vector f ∈ R n , a perfect L p sampler must output an index i with probability |f i | p / f p p , and is allowed to fail with some probability δ. So far, for p > 0 no algorithm has been shown to solve the problem exactly using poly(log n)-bits of space. In 2010, Monemizadeh and Woodruff introduced an approximate L p sampler, which outputs i with probability (1 ± ν)|f i | p / f p p , using space polynomial in ν −1 and log(n). The space complexity was later reduced by Jowhari, Saglam, and Tardos to roughly O(ν −p log 2 n log δ −1 ) for p ∈ (0, 2), which matches the Ω(log 2 n log δ −1 ) lower bound in terms of n and δ, but is loose in terms of ν.Given these nearly tight bounds, it is perhaps surprising that no lower bound exists in terms of ν-not even a bound of Ω(ν −1 ) is known. In this paper, we explain this phenomenon by demonstrating the existence of an O(log 2 n log δ −1 )-bit perfect L p sampler for p ∈ (0, 2). This shows that ν need not factor into the space of an L p sampler, which closes the complexity of the problem for this range of p. For p = 2, our bound is O(log 3 n log δ −1 )-bits, which matches the prior best known upper bound of O(ν −2 log 3 n log δ −1 ), but has no dependence on ν. For p < 2, our bound holds in the random oracle model, matching the lower bounds in that model. Moreover, we show that our algorithm can be derandomized with only a O((log log n) 2 ) blowup in the space (and no blow-up for p = 2). Our derandomization technique is quite general, and can be used to derandomize a large class of linear sketches, including the more accurate count-sketch variant of [MP14], resolving an open question in that paper.Finally, we show that a (1±ǫ) relative error estimate of the frequency f i of the sampled index i can be obtained using an additional O(ǫ −p log n)-bits of space for p < 2, and O(ǫ −2 log 2 n) bits for p = 2, which was possible before only by running the prior algorithms with ν = ǫ.
Two prevalent models in the data stream literature are the insertion-only and turnstile models. Unfortunately, many important streaming problems require a Θ(log(n)) multiplicative factor more space for turnstile streams than for insertion-only streams. This complexity gap often arises because the underlying frequency vector f is very close to 0, after accounting for all insertions and deletions to items. Signal detection in such streams is difficult, given the large number of deletions.In this work, we propose an intermediate model which, given a parameter α ≥ 1, lower bounds the norm f p by a 1/α-fraction of the L p mass of the stream had all updates been positive. Here, for a vector f , f p = ( n i=1 |f i | p ) 1/p , and the value of p we choose depends on the application. This gives a fluid medium between insertion only streams (with α = 1), and turnstile streams (with α = poly(n)), and allows for analysis in terms of α.We show that for streams with this α-property, for many fundamental streaming problems we can replace a O(log(n)) factor in the space usage for algorithms in the turnstile model with a O(log(α)) factor. This is true for identifying heavy hitters, inner product estimation, L 0 estimation, L 1 estimation, L 1 sampling, and support sampling. For each problem, we give matching or nearly matching lower bounds for α-property streams. We note that in practice, many important turnstile data streams are in fact α-property streams for small values of α. For such applications, our results represent significant improvements in efficiency for all the aforementioned problems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.