An optimal algorithm for the distinct elements problem

Kane, Daniel M.; Nelson, Jelani; Woodruff, David P.

doi:10.1145/1807085.1807094

Cited by 269 publications

(283 citation statements)

References 41 publications

Supporting

Mentioning

282

Contrasting

Order By: Relevance

“…Then γ 1 is the number of distinct elements in the sequence W (I). The results of Kane, Nelson and Woodruff [12] imply that using O(ε −2 + log n) space we can compute a valueγ 1 such that…”

Section: Theorem 12mentioning

confidence: 99%

Interval selection in the streaming model

Cabello

Pérez-Lantero

2017

Theoretical Computer Science

View full text Add to dashboard Cite

A set of intervals is independent when the intervals are pairwise disjoint. In the interval selection problem we are given a set I of intervals and we want to find an independent subset of intervals of largest cardinality. Let α(I) denote the cardinality of an optimal solution. We discuss the estimation of α(I) in the streaming model, where we only have one-time, sequential access to the input intervals, the endpoints of the intervals lie in {1, . . . , n}, and the amount of the memory is constrained.For intervals of different sizes, we provide an algorithm in the data stream model that computes an estimateα of α(I) that, with probability at least 2/3, satisfiesFor same-length intervals, we provide another algorithm in the data stream model that computes an estimateα of α(I) that, with probability at least 2/3, satisfies 2 3 (1 − ε)α(I) ≤α ≤ α(I). The space used by our algorithms is bounded by a polynomial in ε −1 and log n. We also show that no better estimations can be achieved using o(n) bits of storage.We also develop new, approximate solutions to the interval selection problem, where we want to report a feasible solution, that use O(α(I)) space. Our algorithms for the interval selection problem match the optimal results by Emek, Halldórsson and Rosén [Space-Constrained Interval Selection, ICALP 2012], but are much simpler.

show abstract

Section: Theorem 12mentioning

confidence: 99%

Interval selection in the streaming model

Cabello

Pérez-Lantero

2017

Theoretical Computer Science

View full text Add to dashboard Cite

show abstract

“…We choose data structures that satisfy our constraints (see \S 2.4), yet note that there are further candidates. For example, there are extensions available for the HyperLogLog algorithm that we use [10]: Kane et al [14] propose an algorithm with an even lower memory overhead; it however remains complex and seems impractical to implement [13]. Heule et al likewise propose a series of improvements to HyperLogLog [13].…”

Section: Communication Overheadmentioning

confidence: 99%

Count Me In: Viable Distributed Summary Statistics for Securing High-Speed Networks

Amann

Hall

Sommer

2014

Research in Attacks, Intrusions and Defenses

View full text Add to dashboard Cite

Abstract. Summary statistics represent a key primitive for profiling and protecting operational networks. Many network operators routinely measure properties such as throughput, traffic mix, and heavy hitters. Likewise, security monitoring often deploys statistical anomaly detectors that trigger, e.g., when a source scans the local IP address range, or exceeds a threshold of failed login attempts. Traditionally, a diverse set of tools is used for such computations, each typically hard-coding either the features it operates on or the specific calculations it performs, or both. In this work we present a novel framework for calculating a wide array of summary statistics in real-time, independent of the underlying data, and potentially aggregated from independent monitoring points. We focus on providing a transparent, extensible, easy-to-use interface and implement our design on top of an open-source network monitoring system. We demonstrate a set of example applications for profiling and statistical anomaly detection that would traditionally require significant effort and different tools to compute. We have released our implementation under BSD license and report experiences from real-world deployments in large-scale network environments.

show abstract

“…A traditional streaming aggregation task, which we henceforth call "whole stream aggregation" asks for aggregates such as a frequency moments, quantiles, or heavy hitters over the entire stream, and a number of works have dealt with estimating these using limited memory (see, e.g., [1,18,21,24,23]). The major difference between whole stream aggregation and correlated aggregation is that in case of correlated aggregation, the scope of aggregation is restricted to only those items which satisfy the selection predicate.…”

Section: Introductionmentioning

confidence: 99%

A General Method for Estimating Correlated Aggregates Over a Data Stream

Tirthapura

Woodruff

2014

Algorithmica

Self Cite

View full text Add to dashboard Cite

On a stream S" role="presentation" style="box-sizing: border-box; display: inline-table; line-height: normal; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">SS of two dimensional data items (x,y)" role="presentation" style="boxsizing: border-box; display: inline-table; line-height: normal; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">(x,y)(x,y) where x" role="presentation" style="box-sizing: border-box; display: inline-table; line-height: normal; letterspacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">xx is an item identifier and y" role="presentation" style="box-sizing: border-box; display: inline-table; line-height: normal; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; minheight: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">yy is a numerical attribute, a correlated aggregate query C(σ,AGG,S)" role="presentation" style="box-sizing: border-box; display: inlinetable; line-height: normal; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">C(σ,AGG,S)C(σ,AGG,S) asks to first apply a selection predicate σ" role="presentation" style="box-sizing: border-box; display: inline-table; line-height: normal; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">σσ along the y" role="presentation" style="box-sizing: border-box; display: inline-table; line-height: normal; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; minheight: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">yy dimension, followed by an aggregation AGG" role="presentation" style="box-sizing: border-box; display: inline-table; line-height: normal; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">AGGAGG along the x" role="presentation" style="...

show abstract

An optimal algorithm for the distinct elements problem

Cited by 269 publications

References 41 publications

Interval selection in the streaming model

Interval selection in the streaming model

Count Me In: Viable Distributed Summary Statistics for Securing High-Speed Networks

A General Method for Estimating Correlated Aggregates Over a Data Stream

Contact Info

Product

Resources

About