Optimizing Frequency Queries for Data Mining Applications

Malik, Hassan; Kender, John R.

doi:10.1109/icdm.2007.34

Cited by 10 publications

(13 citation statements)

References 16 publications

(29 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Additionally, we note that the computational complexity of calculating withindataset pattern significance using an interestingness measure (line 8 of Algorithm 1) depends on the underlying dataset representation used for frequency counting. Experiments in this paper represented datasets as compressed bitmaps (Malik and Kender 2007), with frequency counting time complexity of O (d), where d is the number of documents in the dataset. However in practice, frequency counting using compressed bitmaps takes close to constant time on document datasets because of their inherent sparseness.…”

Section: Discussionmentioning

confidence: 99%

Hierarchical document clustering using local patterns

Malik

Kender

Fradkin

et al. 2010

Data Min Knowl Disc

Self Cite

View full text Add to dashboard Cite

The global pattern mining step in existing pattern-based hierarchical clustering algorithms may result in an unpredictable number of patterns. In this paper, we propose IDHC, a pattern-based hierarchical clustering algorithm that builds a cluster hierarchy without mining for globally significant patterns. IDHC first discovers locally promising patterns by allowing each instance to "vote" for its representative size-2 patterns in a way that ensures an effective balance between local pattern frequency and pattern significance in the dataset. The cluster hierarchy (i.e., the global model) is then directly constructed using these locally promising patterns as features. Each pattern forms an initial (possibly overlapping) cluster, and the rest of the cluster hierarchy is obtained by following a unique iterative cluster refinement process. By effectively utilizing instance-to-cluster relationships, this process directly identifies clusters for each level in the hierarchy, and efficiently prunes duplicate clusters. Furthermore, IDHC produces cluster labels that are more descriptive (patterns are not artificially Responsible editor: Johannes Fürnkranz and Arno Knobbe. 123 154 H. H. Malik et al.restricted), and adapts a soft clustering scheme that allows instances to exist in suitable nodes at various levels in the cluster hierarchy. We present results of experiments performed on 16 standard text datasets, and show that IDHC outperforms state-of-the-art hierarchical clustering algorithms in terms of average entropy and FScore measures.

show abstract

Section: Discussionmentioning

confidence: 99%

Hierarchical document clustering using local patterns

Malik

Kender

Fradkin

et al. 2010

Data Min Knowl Disc

Self Cite

View full text Add to dashboard Cite

show abstract

“…This structure referred to as Regular Ordered Trie (ROT ) has been investigated in a different setting [21], as discussed in Section 5. It was also used more recently in data mining [4,14]. Figure 1(c) depicts the ROT index for the example subscriptions of Table 1 where the term rank is given by the term subscript (t1 has the highest rank in VS).…”

Section: Regular Ordered Triementioning

confidence: 99%

“…In this paper, we are interested in efficient implementations of both indexing schemes using Inverted Lists (IL) [23] for CI and a variant for distinct terms of Ordered Tries (OT) [11] for TI and study their behavior for critical parameters of realistic Web syndication workloads. Although these data structures have been employed to evaluate broad match queries in the context of selective information dissemination [21] and sponsored search [12] or for mining frequent Item sets [3,14], their memory and matching time requirements appear to be quite different in our setting. This is due to the peculiarities of Web syndication systems which are characterized [8] (a) by information items of average length (25-36 distinct terms) which are greater than advertisement bids (4-5 terms [12]) and smaller than documents of Web collections (12K terms [21]) (b) by very large vocabularies of terms (up to 1.5M terms) Note also, that due to broad match semantics Information Retrieval techniques for optimizing ILs (e.g., early pruning [23]) are not suited in our setting.…”

Section: Introductionmentioning

confidence: 99%

Subscription indexes for web syndication systems

Hmedeh

Kourdounakis

Christophides

et al. 2012

Proceedings of the 15th International Conference on Extending Database Technology

View full text Add to dashboard Cite

The explosion of published information on the Web leads to the emergence of a Web syndication paradigm, which transforms the passive reader into an active information collector. Information consumers subscribe to RSS/Atom feeds and are notified whenever a piece of news (item) is published. The success of this Web syndication now offered on Web sites, blogs, and social media, however raises scalability issues. There is a vital need for efficient real-time filtering methods across feeds, to allow users to follow effectively personally interesting information. We investigate in this paper three indexing techniques for users' subscriptions based on inverted lists or on an ordered trie. We present analytical models for memory requirements and matching time and we conduct a thorough experimental evaluation to exhibit the impact of critical workload parameters on these structures.

show abstract

“…Still in the context of bitmap indexes, Malik and Kender [2007] get good compression results using a variation on the Nearest Neighbor TSP heuristic. Unfortunately, its quadratic time complexity makes the processing of large data sets difficult.…”

Section: Related Workmentioning

confidence: 99%

“…Unfortunately, its quadratic time complexity makes the processing of large data sets difficult. To improve scalability, Malik and Kender [2007] also propose a faster heuristic called aHDO which we review in § 3.2. In comparison, our novel Multiple Lists heuristic is also an attempt to get a more scalable Nearest Neighbor heuristic.…”

Section: Related Workmentioning

confidence: 99%

Reordering rows for better compression

Lemire¹,

Kaser

Gutarra

2012

ACM Trans. Database Syst.

View full text Add to dashboard Cite

Sorting database tables before compressing them improves the compression rate. Can we do better than the lexicographical order? For minimizing the number of runs in a run-length encoding compression scheme, the best approaches to row-ordering are derived from traveling salesman heuristics, although there is a significant trade-off between running time and compression. A new heuristic, MULTIPLE LISTS, which is a variant on NEAREST NEIGHBOR that trades off compression for a major running-time speedup, is a good option for very large tables. However, for some compression schemes, it is more important to generate long runs rather than few runs. For this case, another novel heuristic, VORTEX, is promising. We find that we can improve run-length encoding up to a factor of 3 whereas we can improve prefix coding by up to 80%: these gains are on top of the gains due to lexicographically sorting the table. We prove that the new row reordering is optimal (within 10%) at minimizing the runs of identical values within columns, in a few cases.

show abstract

Optimizing Frequency Queries for Data Mining Applications

Abstract: Data mining algorithms use various Trie and bitmap-based

Cited by 10 publications

References 16 publications

Hierarchical document clustering using local patterns

Hierarchical document clustering using local patterns

Subscription indexes for web syndication systems

Reordering rows for better compression

Contact Info

Product

Resources

About