S. Muthukrishnan scite author profile

Abstract. We introduce a new sublinear space data structure-the Count-Min Sketch-for summarizing data streams. Our sketch allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition, it can be applied to solve several important problems in data streams such as finding quantiles, frequent items, etc. The time and space bounds we show for using the CM sketch to solve these problems significantly improve those previously known -typically from 1/ε 2 to 1/ε in factor.

show abstract

Data Streams: Algorithms and Applications

Muthukrishnan

2005

FNT in Theoretical Computer Science

759

541

View full text Add to dashboard Cite

In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudo-random computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [175].

show abstract

Data Streams: Algorithms and Applications

Muthukrishnan¹

2005

640

462

View full text Add to dashboard Cite

show abstract

What's hot and what's not: tracking most frequent items dynamically

Cormode

Muthukrishnan

2005

ACM Trans. Database Syst.

391

360

View full text Add to dashboard Cite

Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the "hot items" in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in networking applications.We present a new algorithm for dynamically determining the hot items at any time in the relation that is undergoing deletion operations as well as inserts. Our algorithm maintains a small space data structure that monitors the transactions on the relation, and when required, quickly outputs all hot items, without rescanning the relation in the database. With user-specified probability, it is able to report all hot items. Our algorithm relies on the idea of "group testing", is simple to implement, and has provable quality, space and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees can not handle deletions, and those that handle deletions can not make similar guarantees without rescanning the database. Our experiments with real and synthetic data shows that our algorithm is remarkably accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.

show abstract

Node Classification in Social Networks

2011

View full text Add to dashboard Cite

When dealing with large graphs, such as those that arise in the context of online social networks, a subset of nodes may be labeled. These labels can indicate demographic values, interest, beliefs or other characteristics of the nodes (users). A core problem is to use this information to extend the labeling so that all nodes are assigned a label (or labels). In this chapter, we survey classification techniques that have been proposed for this problem. We consider two broad categories: methods based on iterative application of traditional classifiers using graph information as features, and methods which propagate the existing labels via random walks. We adopt a common perspective on these methods to highlight the similarities between different approaches within and across the two categories. We also describe some extensions and related directions to the central problem of node classification.

show abstract

Influence sets based on reverse nearest neighbor queries

2000

View full text Add to dashboard Cite

Inherent in the operation of many decision support and continuous referral systems is the notion of the in uence" of a data point on the database. This notion arises in examples such as nding the set of customers a ected by the opening of a new store outlet location, notifying the subset of subscribers to a digital library who will nd a newly added document most relevant, etc. Standard approaches to determining the in uence set of a data point i n volve range searching and nearest neighbor queries.In this paper, we formalize a novel notion of in uence based on reverse neighbor queries and its variants. Since the nearest neighbor relation is not symmetric, the set of points that are closest to a query point i.e., the nearest neighbors di ers from the set of points that have the query point a s their nearest neighbor called the reverse nearest neighbors. In uence sets based on reverse nearest neighbor RNN queries seem to capture the intuitive notion of in uence from our motivating examples.We present a general approach for solving RNN queries and an e cient R-tree based method for large data sets, based on this approach. Although the RNN query appears to be natural, it has not been studied previously. RNN queries are of independent i n terest, and as such should be part of the suite of available queries for processing spatial and multimedia data. In our experiments with real geographical data, the proposed method appears to scale logarithmically, whereas straightforward sequential scan scales linearly. Our experimental study also shows that approaches based on range searching or nearest neighbors are ine ective at nding in uence sets of our interest.

show abstract

Heavy-Hitter Detection Entirely in the Data Plane

Narayana²,

et al. 2017

View full text Add to dashboard Cite

Identifying the "heavy hitter" ows or ows with large tra c volumes in the data plane is important for several applications e.g., ow-size aware routing, DoS detection, and tra c engineering. However, measurement in the data plane is constrained by the need for linerate processing (at 10-100Gb/s) and limited memory in switching hardware. We propose HashPipe, a heavy hitter detection algorithm using emerging programmable data planes. HashPipe implements a pipeline of hash tables which retain counters for heavy ows in the tables while evicting lighter ows over time. We prototype HashPipe in P4 and evaluate it with packet traces from an ISP backbone link and a data center. On the ISP trace, we nd that HashPipe identi es 95% of the 300 heaviest ows with less than 80KB of memory on a trace that contains 400,000 ows.

show abstract

An improved data stream summary: the count-min sketch and its applications

Cormode¹,

Muthukrishnan²

2005

Journal of Algorithms

1,399

206

View full text Add to dashboard Cite

We introduce a new sublinear space data structure-the count-min sketch-for summarizing data streams. Our sketch allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition, it can be applied to solve several important problems in data streams such as finding quantiles, frequent items, etc. The time and space bounds we show for using the CM sketch to solve these problems significantly improve those previously known-typically from 1/ε 2 to 1/ε in factor.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

S. Muthukrishnan

An Improved Data Stream Summary: The Count-Min Sketch and Its Applications

Data Streams: Algorithms and Applications

Data Streams: Algorithms and Applications

What's hot and what's not: tracking most frequent items dynamically

Node Classification in Social Networks

Influence sets based on reverse nearest neighbor queries

Heavy-Hitter Detection Entirely in the Data Plane

An improved data stream summary: the count-min sketch and its applications

Contact Info

Product

Resources

About