Estimating simple functions on the union of data streams

Gibbons, Phillip B.; Tirthapura, Srikanta

doi:10.1145/378580.378687

Cited by 169 publications

(179 citation statements)

References 28 publications

Supporting

Mentioning

177

Contrasting

Order By: Relevance

“…Alon, Matias and Szegedy [1] gave a constant factor approximation in small space. Gibbons and Tirthapura [10] showed a (1± ) factor approximation spaceÕ( 1 2 ); subsequent work has improved the (hidden) logarithmic factors [2].…”

Section: Reviewmentioning

confidence: 99%

On Estimating Frequency Moments of Data Streams

Ganguly

Cormode

2007

Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques

View full text Add to dashboard Cite

Abstract. Space-economical estimation of the pth frequency moments, defined as Fp = P n i=1 |fi| p , for p > 0, are of interest in estimating all-pairs distances in a large data matrix [14], machine learning, and in data stream computation. Random sketches formed by the inner product of the frequency vector f1, . . . , fn with a suitably chosen random vector were pioneered by Alon, Matias and Szegedy [1], and have since played a central role in estimating Fp and for data stream computations in general. The concept of p-stable sketches formed by the inner product of the frequency vector with a random vector whose components are drawn from a p-stable distribution, was proposed by Indyk [11] for estimating Fp, for 0 < p < 2, and has been further studied in Li [13]. In this paper, we consider the problem of estimating Fp, for 0 < p < 2. A disadvantage of the stable sketches technique and its variants is that they require O( 1 2 ) inner-products of the frequency vector with dense vectors of stable (or nearly stable [14,13]) random variables to be maintained. This means that each stream update can be quite time-consuming. We present algorithms for estimating Fp, for 0 < p < 2, that does not require the use of stable sketches or its approximations. Our technique is elementary in nature, in that, it uses simple randomization in conjunction with well-known summary structures for data streams, such as the COUNT-MIN sketch [7] and the COUNTSKETCH structure [5]. Our algorithms require spaceÕ( 1 2+p ) 3 to estimate Fp to within 1 ± factors and requires expected time O(log F1 log 1 δ ) to process each update. Thus, our technique trades an O( 1 p ) factor in space for much more efficient processing of stream updates. We also present a stand-alone iterative estimator for F1.

show abstract

Section: Reviewmentioning

confidence: 99%

On Estimating Frequency Moments of Data Streams

Ganguly

Cormode

2007

Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques

View full text Add to dashboard Cite

show abstract

“…However, it is more restricted in that it requires that a data item can never again be retrieved in main memory after its first pass (if it is a one-pass algorithm). A distributed stream model is also proposed in [53] which combines features of both streaming models and communication complexity models.…”

Section: The Data Stream Computation Modelmentioning

confidence: 99%

On Approximation Algorithms for Data Mining Applications

Afrati

2006

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…There has been a lot of work in computing over data streams for purposes such as set resemblance, data mining, creating histograms, and so on [11], [17], [29]. Particularly relevant is some recent work [23], [25] which studies the problem of finding the size of the union of two streams. Here, the streams define multisets of elements, and it is the size of the union of the supporting sets that is of interest.…”

Section: Work On Data Streams and Sketchesmentioning

confidence: 99%

Comparing Data Streams Using Hamming Norms (How to Zero In)

Cormode¹

2002

VLDB '02: Proceedings of the 28th International Conference on Very Large Databases

View full text Add to dashboard Cite

Abstract-Massive data streams are now fundamental to many data processing applications. For example, Internet routers produce large scale diagnostic data streams. Such streams are rarely stored in traditional databases and instead must be processed "on the fly" as they are produced. Similarly, sensor networks produce multiple data streams of observations from their sensors. There is growing focus on manipulating data streams and, hence, there is a need to identify basic operations of interest in managing data streams, and to support them efficiently. We propose computation of the Hamming norm as a basic operation of interest. The Hamming norm formalizes ideas that are used throughout data processing. When applied to a single stream, the Hamming norm gives the number of distinct items that are present in that data stream, which is a statistic of great interest in databases. When applied to a pair of streams, the Hamming norm gives an important measure of (dis)similarity: the number of unequal item counts in the two streams. Hamming norms have many uses in comparing data streams. We present a novel approximation technique for estimating the Hamming norm for massive data streams; this relies on what we call the "l 0 sketch" and we prove its accuracy. We test our approximation method on a large quantity of synthetic and real stream data, and show that the estimation is accurate to within a few percentage points.

show abstract

Estimating simple functions on the union of data streams

Cited by 169 publications

References 28 publications

On Estimating Frequency Moments of Data Streams

On Estimating Frequency Moments of Data Streams

On Approximation Algorithms for Data Mining Applications

Comparing Data Streams Using Hamming Norms (How to Zero In)

Contact Info

Product

Resources

About