Seventh IEEE International Conference on Data Mining (ICDM 2007) 2007
DOI: 10.1109/icdm.2007.34
|View full text |Cite
|
Sign up to set email alerts
|

Optimizing Frequency Queries for Data Mining Applications

Abstract: Data mining algorithms use various Trie and bitmap-based

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
13
0

Year Published

2010
2010
2016
2016

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 10 publications
(13 citation statements)
references
References 16 publications
(29 reference statements)
0
13
0
Order By: Relevance
“…Additionally, we note that the computational complexity of calculating withindataset pattern significance using an interestingness measure (line 8 of Algorithm 1) depends on the underlying dataset representation used for frequency counting. Experiments in this paper represented datasets as compressed bitmaps (Malik and Kender 2007), with frequency counting time complexity of O (d), where d is the number of documents in the dataset. However in practice, frequency counting using compressed bitmaps takes close to constant time on document datasets because of their inherent sparseness.…”
Section: Discussionmentioning
confidence: 99%
“…Additionally, we note that the computational complexity of calculating withindataset pattern significance using an interestingness measure (line 8 of Algorithm 1) depends on the underlying dataset representation used for frequency counting. Experiments in this paper represented datasets as compressed bitmaps (Malik and Kender 2007), with frequency counting time complexity of O (d), where d is the number of documents in the dataset. However in practice, frequency counting using compressed bitmaps takes close to constant time on document datasets because of their inherent sparseness.…”
Section: Discussionmentioning
confidence: 99%
“…This structure referred to as Regular Ordered Trie (ROT ) has been investigated in a different setting [21], as discussed in Section 5. It was also used more recently in data mining [4,14]. Figure 1(c) depicts the ROT index for the example subscriptions of Table 1 where the term rank is given by the term subscript (t1 has the highest rank in VS).…”
Section: Regular Ordered Triementioning
confidence: 99%
“…In this paper, we are interested in efficient implementations of both indexing schemes using Inverted Lists (IL) [23] for CI and a variant for distinct terms of Ordered Tries (OT) [11] for TI and study their behavior for critical parameters of realistic Web syndication workloads. Although these data structures have been employed to evaluate broad match queries in the context of selective information dissemination [21] and sponsored search [12] or for mining frequent Item sets [3,14], their memory and matching time requirements appear to be quite different in our setting. This is due to the peculiarities of Web syndication systems which are characterized [8] (a) by information items of average length (25-36 distinct terms) which are greater than advertisement bids (4-5 terms [12]) and smaller than documents of Web collections (12K terms [21]) (b) by very large vocabularies of terms (up to 1.5M terms) Note also, that due to broad match semantics Information Retrieval techniques for optimizing ILs (e.g., early pruning [23]) are not suited in our setting.…”
Section: Introductionmentioning
confidence: 99%
“…Still in the context of bitmap indexes, Malik and Kender [2007] get good compression results using a variation on the Nearest Neighbor TSP heuristic. Unfortunately, its quadratic time complexity makes the processing of large data sets difficult.…”
Section: Related Workmentioning
confidence: 99%
“…Unfortunately, its quadratic time complexity makes the processing of large data sets difficult. To improve scalability, Malik and Kender [2007] also propose a faster heuristic called aHDO which we review in § 3.2. In comparison, our novel Multiple Lists heuristic is also an attempt to get a more scalable Nearest Neighbor heuristic.…”
Section: Related Workmentioning
confidence: 99%