An Optimal Algorithm for l1-Heavy Hitters in Insertion Streams and Related Problems

Bhattacharyya, Arnab; Dey, Palash; Woodruff, David P.

doi:10.1145/2902251.2902284

Cited by 17 publications

(24 citation statements)

References 66 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is known that there are significant differences between these models. For instance, identifying an index i ∈ [n] for which |x i | > 1 10 n j=1 |x j | can be accomplished with only O(log(n)) bits of space in the insertion-only model [10], but requires Ω(log 2 (n)) bits in the turnstile model [38]. This log(n) gap between the complexity in the two models occurs in many other important streaming problems.…”

Section: Introductionmentioning

confidence: 99%

Data Streams with Bounded Deletions

Jayaram

Woodruff

2018

Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Self Cite

View full text Add to dashboard Cite

Two prevalent models in the data stream literature are the insertion-only and turnstile models. Unfortunately, many important streaming problems require a Θ(log(n)) multiplicative factor more space for turnstile streams than for insertion-only streams. This complexity gap often arises because the underlying frequency vector f is very close to 0, after accounting for all insertions and deletions to items. Signal detection in such streams is difficult, given the large number of deletions.In this work, we propose an intermediate model which, given a parameter α ≥ 1, lower bounds the norm f p by a 1/α-fraction of the L p mass of the stream had all updates been positive. Here, for a vector f , f p = ( n i=1 |f i | p ) 1/p , and the value of p we choose depends on the application. This gives a fluid medium between insertion only streams (with α = 1), and turnstile streams (with α = poly(n)), and allows for analysis in terms of α.We show that for streams with this α-property, for many fundamental streaming problems we can replace a O(log(n)) factor in the space usage for algorithms in the turnstile model with a O(log(α)) factor. This is true for identifying heavy hitters, inner product estimation, L 0 estimation, L 1 estimation, L 1 sampling, and support sampling. For each problem, we give matching or nearly matching lower bounds for α-property streams. We note that in practice, many important turnstile data streams are in fact α-property streams for small values of α. For such applications, our results represent significant improvements in efficiency for all the aforementioned problems.

show abstract

Section: Introductionmentioning

confidence: 99%

Data Streams with Bounded Deletions

Jayaram

Woodruff

2018

Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Self Cite

View full text Add to dashboard Cite

show abstract

“…All the experimental metrics are averaged over 5 independent runs. Moreover, in all experiments, Lazy SpaceSaving ± and SpaceSaving ± use the same amount of space, while the universe size is 𝑈 = 2 16 , and we set 𝛿 = 𝑈 −1 to align the experiments with the theoretical literature [7,30].…”

Section: Methodsmentioning

confidence: 99%

SpaceSaving$^\pm$: An Optimal Algorithm for Frequency Estimation and Frequent items in the Bounded Deletion Model

Zhao,

Agrawal,

Abbadi

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we propose the first deterministic algorithms to solve the frequency estimation and frequent item problems in the bounded deletion model. We establish the space lower bound for solving the deterministic frequent items problem in the bounded deletion model, and propose the Lazy SpaceSaving ± and SpaceSaving ± algorithms with optimal space bound. We develop an efficient implementation of the SpaceSaving ± algorithm that minimizes the latency of update operations using novel data structures. The experimental evaluations testify that SpaceSaving ± has accurate frequency estimations and achieves very high recall and precision across different data distributions while using minimal space. Our analysis and experiments clearly demonstrate that SpaceSaving ± provides more accurate estimations using the same space as the state of the art protocols for applications with up to 𝑙𝑜𝑔𝑈 −1 𝑙𝑜𝑔𝑈 of items deleted, where 𝑈 is the input universe size. Moreover, motivated by prior work, we propose Dyadic SpaceSaving ± , the first deterministic quantile approximation sketch in the bounded deletion model.

show abstract

“…We approximate instantaneous throughput by calculating throughput (using system timestamps) every κ observations. In our evaluation, we fix κ = 2 17 .…”

Section: Methodsmentioning

confidence: 99%

Timely Reporting of Heavy Hitters using External Memory

Pandey

Singh

Bender

et al. 2020

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

View full text Add to dashboard Cite

Given an input stream of size N , a ϕ-heavy hitter is an item that occurs at least ϕN times in S. The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = ϕNth occurrence (and hence becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams, and with a low reporting threshold (high sensitivity). Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (Ω(N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable trade-off between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead.

show abstract

An Optimal Algorithm for l1-Heavy Hitters in Insertion Streams and Related Problems

Cited by 17 publications

References 66 publications

Data Streams with Bounded Deletions

Data Streams with Bounded Deletions

SpaceSaving$^\pm$: An Optimal Algorithm for Frequency Estimation and Frequent items in the Bounded Deletion Model

Timely Reporting of Heavy Hitters using External Memory

Contact Info

Product

Resources

About