Mining Statistically Significant Sequential Patterns

Low‐Kam, Cécile; Raïssi, Chedy; Kaytoue, Mehdi; Pei, Jian

doi:10.1109/icdm.2013.124

Cited by 35 publications

(34 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are a number of directions for further research. Among these, we find particularly interesting and challenging the extension of our method to other definitions of statistical significance for patterns and to other definitions of patterns such as sequential patterns [46]. Also interesting is the derivation of better lower bounds to the VC-dimension of the range set of a collection of itemsets.…”

Section: Discussionmentioning

confidence: 99%

Finding the True Frequent Itemsets

Riondato

Vandin

2014

Proceedings of the 2014 SIAM International Conference on Data Mining

View full text Add to dashboard Cite

Frequent Itemsets (FIs) mining is a fundamental primitive in knowledge discovery. It requires to identify all itemsets appearing in at least a fraction θ of a transactional dataset D. Often though, the ultimate goal of mining D is not an analysis of the dataset per se, but the understanding of the underlying process that generated it. Specifically, in many applications D is a collection of samples obtained from an unknown probability distribution π on transactions, and by extracting the FIs in D one attempts to infer itemsets that are frequently (i.e., with probability at least θ) generated by π, which we call the True Frequent Itemsets (TFIs). Due to the inherently stochastic nature of the generative process, the set of FIs is only a rough approximation of the set of TFIs, as it often contains a huge number of false positives, i.e., spurious itemsets that are not among the TFIs. In this work we design and analyze an algorithm to identify a thresholdθ such that the collection of itemsets with frequency at leastθ in D contains only TFIs with probability at least 1 − δ, for some user-specified δ. Our method uses results from statistical learning theory involving the (empirical) VC-dimension of the problem at hand. This allows us to identify almost all the TFIs without including any false positive. We also experimentally compare our method with the direct mining of D at frequency θ and with techniques based on widely-used standard bounds (i.e., the Chernoff bounds) of the binomial distribution, and show that our algorithm outperforms these methods and achieves even better results than what is guaranteed by the theoretical analysis.

show abstract

Section: Discussionmentioning

confidence: 99%

Finding the True Frequent Itemsets

Riondato

Vandin

2014

Proceedings of the 2014 SIAM International Conference on Data Mining

View full text Add to dashboard Cite

show abstract

“…[15] solves the problem of mining statistically significant substrings in a string generated from a memoryless Bernoulli distribution and uses the chi-square statistic as a quantitative measure of statistical significance. The statistical significance is considered for the sequential pattern mining problem as well in [16]. The approach developed by the authors is able to efficiently mine unexpected patterns in sequence of itemsets without considering overlapping occurrences or conditioning the length of the sequence.…”

Section: Related Workmentioning

confidence: 99%

Mining Statistically Significant Attribute Associations in Attributed Graphs

Lee

Park²,

Prabhakar

2016

2016 IEEE 16th International Conference on Data Mining (ICDM)

View full text Add to dashboard Cite

Recently, graphs have been widely used to represent many different kinds of real world data or observations such as social networks, protein-protein networks, road networks, and so on. In many cases, each node in a graph is associated with a set of its attributes and it is critical to not only consider the link structure of a graph but also use the attribute information to achieve more meaningful results in various graph mining tasks. Most previous works with attributed graphs take into account attribute relationships only between individually connected nodes. However, it should be greatly valuable to find out which sets of attributes are associated with each other and whether they are statistically significant or not. Mining such significant associations, we can uncover novel relationships among the sets of attributes in the graph. We propose an algorithm that can find those attribute associations efficiently and effectively, and show experimental results that confirm the high applicability of the proposed algorithm.

show abstract

“…Instead of using windows of fixed length, ranking based on minimal window lengths with respect to the independence model was suggested by Tatti [11]. Ranking serial episodes allowing multiple labels using the independence model was suggested by Low-Kam et al [7]. Achar et al [1] also considered a measure that downranks the episode if there is a non-edge (x, y) that occurs rarely, which suggests that we should augment the episode with the edge (y, x).…”

Section: Related Workmentioning

confidence: 99%

Ranking episodes using a partition model

Tatti

2015

Data Min Knowl Disc

View full text Add to dashboard Cite

One of the biggest setbacks in traditional frequent pattern mining is that overwhelmingly many of the discovered patterns are redundant. A prototypical example of such redundancy is a freerider pattern where the pattern contains a true pattern and some additional noise events. A technique for filtering freerider patterns that has proved to be efficient in ranking itemsets is to use a partition model where a pattern is divided into two subpatterns and the observed support is compared to the expected support under the assumption that these two subpatterns occur independently.In this paper we develop a partition model for episodes, patterns discovered from sequential data. An episode is essentially a set of events, with possible restrictions on the order of events. Unlike with itemset mining, computing the expected support of an episode requires surprisingly sophisticated methods. In order to construct the model, we partition the episode into two subepisodes. We then model how likely the events in each subepisode occur close to each other. If this probability is high-which is often the case if the subepisode has a high support-then we can expect that when one event from a subepisode occurs, then the remaining events occur also close by. This approach increases the expected support of the episode, and if this increase explains the observed support, then we can deem the episode uninteresting. We demonstrate in our experiments that using the partition model can effectively and efficiently reduce the redundancy in episodes.

show abstract

Mining Statistically Significant Sequential Patterns

Cited by 35 publications

References 28 publications

Finding the True Frequent Itemsets

Finding the True Frequent Itemsets

Mining Statistically Significant Attribute Associations in Attributed Graphs

Ranking episodes using a partition model

Contact Info

Product

Resources

About