Finding (Recently) Frequent Items in Distributed Data Streams

Manjhi, Amit; Shkapenyuk, Vladislav; Dhamdhere, Kedar; Olston, Christopher

doi:10.1109/icde.2005.68

Cited by 145 publications

(131 citation statements)

References 19 publications

(47 reference statements)

Supporting

Mentioning

130

Contrasting

Unclassified

Order By: Relevance

“…This is studied by Babcock et al [18] in the distributed setting, and extended by Olston et al [19] to support sum and average queries. These approaches aim to keep the local elephants aligned with the global ones and hence face the same issue as the above solution [17]-icebergs that are finely distributed among the local nodes are hard to discover.…”

Section: Related Workmentioning

confidence: 99%

“…Our work differs from theirs since we assume fixed measurement periods, which potentially allows us to have more communication-efficient mechanisms. Manjhi et al [17] studied the problem of discovering icebergs in a distributed environment when nodes are arranged in a multi-level communication hierarchy. We study the simpler, practically motivated single-level communication scheme instead.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Uncovering Global Icebergs in Distributed Streams: Results and Implications

et al. 2010

View full text Add to dashboard Cite

Discovering icebergs in distributed streams of data is an important problem for a number of applications in networking and databases. While previous work has concentrated on measuring these icebergs in the non-distributed streaming case or in the non-streaming distributed case, we present a general framework that allows for distributed processing across multiple streams of data. We compare several of the state-of-the-art streaming algorithms for estimating local elephants in the individual streams. However, since an iceberg may be hidden by being distributed across many different streams, we add a sampling component to handle such cases. We provide a novel taxonomy of current sketches and perform a thorough analysis of the strengths and weaknesses of each scheme under various QoS metrics, using both real and synthetic Internet trace data. We summarize their performance and discuss the implications for the future design of sketches.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Uncovering Global Icebergs in Distributed Streams: Results and Implications

et al. 2010

View full text Add to dashboard Cite

show abstract

“…For example, in the case of network routers, maintaining a random sample from the union of the streams is valuable for network monitoring tasks involving the detection of global properties [4]. Other problems on distributed stream processing, including the estimation of the number of distinct elements [1], [5] and heavy hitters [6], [7], [8], [9], use random sampling as a primitive (we note, though, that better solutions for the heavy hitters problem in terms of the accuracy parameter may be possible [9] than those provided by random sampling). Distributed random sampling is already used in current day "big data" systems such as BlinkDB [10], which use stored random samples to process queries quickly, in exchange for relaxed accuracy guarantees.…”

Section: Introductionmentioning

confidence: 99%

A Simple Message-Optimal Algorithm for Random Sampling from a Distributed Stream

Chung

Tirthapura

Woodruff

2016

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

We present a simple, message-optimal algorithm for maintaining a random sample from a large data stream whose input elements are distributed across multiple sites that communicate via a central coordinator. At any point in time, the set of elements held by the coordinator represent a uniform random sample from the set of all the elements observed so far. When compared with prior work, our algorithms asymptotically improve the total number of messages sent in the system. We present a matching lower bound, showing that our protocol sends the optimal number of messages up to a constant factor with large probability. We also consider the important case when the distribution of elements across different sites is non-uniform, and show that for such inputs, our algorithm significantly outperforms prior solutions. Abstract-We present a simple, message-optimal algorithm for maintaining a random sample from a large data stream whose input elements are distributed across multiple sites that communicate via a central coordinator. At any point in time, the set of elements held by the coordinator represent a uniform random sample from the set of all the elements observed so far. When compared with prior work, our algorithms asymptotically improve the total number of messages sent in the system. We present a matching lower bound, showing that our protocol sends the optimal number of messages up to a constant factor with large probability. We also consider the important case when the distribution of elements across different sites is non-uniform, and show that for such inputs, our algorithm significantly outperforms prior solutions. Keywords

show abstract

“…A number of heuristic solutions have been proposed recently for set unions such as the game above and other set expressions, quantiles, heavy hitters and sketch-maintenance [163,98,49,48].…”

Section: Distributed Continuous Computationmentioning

confidence: 99%

Data Streams: Algorithms and Applications

Muthukrishnan

2005

FNT in Theoretical Computer Science

758

540

View full text Add to dashboard Cite

In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudo-random computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [175].

show abstract

Finding (Recently) Frequent Items in Distributed Data Streams

Cited by 145 publications

References 19 publications

Uncovering Global Icebergs in Distributed Streams: Results and Implications

Uncovering Global Icebergs in Distributed Streams: Results and Implications

A Simple Message-Optimal Algorithm for Random Sampling from a Distributed Stream

Data Streams: Algorithms and Applications

Contact Info

Product

Resources

About