Summarizing data succinctly with the most informative itemsets

Mampaey, Michael; Vreeken, Jilles; Tatti, Nikolaj

doi:10.1145/2382577.2382580

Cited by 49 publications

(57 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Cues may be taken from recent developments in pattern set mining, where algorithms have been proposed that can mine high-quality results directly from data [Smets and Vreeken 2012;Akoglu et al 2012;Mampaey et al 2012]. …”

Section: Discussionmentioning

confidence: 99%

“…All typically return more, and in particular more specific patterns than BMF. Wang and Parthasarathy [2006] and Mampaey et al [2012] propose algorithms for summarizing data with sets of itemsets and frequencies. To this end, they construct a probabilistic model for the rows of the data by the maximum entropy principle, and iteratively mine itemsets that maximize the likelihood of the data under the model, while controlling complexity through BIC or MDL scores.…”

Section: Pattern-based Summarizationmentioning

confidence: 99%

See 1 more Smart Citation

Mdl4bmf

Miettinen

Vreeken

2014

ACM Trans. Knowl. Discov. Data

Self Cite

View full text Add to dashboard Cite

Matrix factorizations-where a given data matrix is approximated by a product of two or more factor matrices-are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the "model order selection problem" of determining the proper rank of the factorization, that is, to answer where fine-grained structure stops, and where noise starts.Boolean Matrix Factorization (BMF)-where data, factors, and matrix product are Boolean-has in recent years received increased attention from the data mining community. The technique has desirable properties, such as high interpretability and natural sparsity. Yet, so far no method for selecting the correct model order for BMF has been available. In this article, we propose the use of the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this well-founded approach has numerous benefits; for example, it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate.We formulate the description length function for BMF in general-making it applicable for any BMF algorithm. We discuss how to construct an appropriate encoding: starting from a simple and intuitive approach, we arrive at a highly efficient data-to-model-based encoding for BMF. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior. ACM Reference Format:Pauli Miettinen and Jilles Vreeken. 2014. MDL4BMF: Minimum description length for Boolean matrix factorization.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Pattern-based Summarizationmentioning

confidence: 99%

Mdl4bmf

Miettinen

Vreeken

2014

ACM Trans. Knowl. Discov. Data

Self Cite

View full text Add to dashboard Cite

show abstract

“…Note that, following our generalised anomaly score for class 1 anomalies, any method that provides a probability for a transaction can be used. Examples based on pattern sets are those of Wang and Parthasarathy [28] and Mampaey et al [16].…”

Section: Related Workmentioning

confidence: 99%

“…KRIMP [27] and SLIM [24] are two deterministic algorithms that heuristically optimise this score. Other pattern set mining techniques, especially those that mine patterns characteristic for the data such as [16,9,28], are also meaningful choices to be used with UPC.…”

Section: Related Workmentioning

confidence: 99%

Efficiently Discovering Unexpected Pattern-Co-Occurrences

Bertens¹,

Vreeken

Siebes³

2017

Proceedings of the 2017 SIAM International Conference on Data Mining

Self Cite

View full text Add to dashboard Cite

Our world is filled with both beautiful and brainy people, but how often does a Nobel Prize winner also wins a beauty pageant? Let us assume that someone who is both very beautiful and very smart is more rare than what we would expect from the combination of the number of beautiful and brainy people. Of course there will still always be some individuals that defy this stereotype; these beautiful brainy people are exactly the class of anomaly we focus on in this paper. They do not posses intrinsically rare qualities, it is the unexpected combination of factors that makes them stand out.In this paper we define the above described class of anomaly and propose a method to quickly identify them in transaction data. Further, as we take a pattern set based approach, our method readily explains why a transaction is anomalous. The effectiveness of our method is thoroughly verified with a wide range of experiments on both real world and synthetic data.

show abstract

“…One has to realize, however, that enumerating the pattern space can actually in itself already be infeasible. To decrease redundancy in pattern collections even more, pattern set mining algorithms became increasingly important [1,20,17]. The goal of pattern set mining techniques is to find a small collection of patterns that are interesting together, rather than on their own.…”

Section: Introductionmentioning

confidence: 99%

Randomly sampling maximal itemsets

Moens

Goethals

2013

Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics

View full text Add to dashboard Cite

Pattern mining techniques generally enumerate lots of uninteresting and redundant patterns. To obtain less redundant collections, techniques exist that give condensed representations of these collections. However, the proposed techniques often rely on complete enumeration of the pattern space, which can be prohibitive in terms of time and memory. Sampling can be used to filter the output space of patterns without explicit enumeration. We propose a framework for random sampling of maximal itemsets from transactional databases. The presented framework can use any monotonically decreasing measure as interestingness criteria for this purpose. Moreover, we use an approximation measure to guide the search for maximal sets to different parts of the output space. We show in our experiments that the method can rapidly generate small collections of patterns with good quality. The sampling framework has been implemented in the interactive visual data mining tool called MIME 1 , as such enabling users to quickly sample a collection of patterns and analyze the results.

show abstract

Summarizing data succinctly with the most informative itemsets

Cited by 49 publications

References 44 publications

Mdl4bmf

Mdl4bmf

Efficiently Discovering Unexpected Pattern-Co-Occurrences

Randomly sampling maximal itemsets

Contact Info

Product

Resources

About