One in a million: picking the right patterns

Bringmann, Björn; Zimmermann, Albrecht

doi:10.1007/s10115-008-0136-4

Cited by 33 publications

(30 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another exciting question is whether our results on the optimality of supervised feature selection can be transfered to techniques for unsupervised feature selection on frequent subgraphs [5]. We are positive that this is possible (S. Nijssen, personal communication (2008)).…”

Section: Discussionmentioning

confidence: 92%

Near-optimal supervised feature selection among frequent subgraphs

Thoma

Cheng

Gretton

et al. 2009

Proceedings of the 2009 SIAM International Conference on Data Mining

View full text Add to dashboard Cite

Graph classification is an increasingly important step in numerous application domains, such as function prediction of molecules and proteins, computerised scene analysis, and anomaly detection in program flows.Among the various approaches proposed in the literature, graph classification based on frequent subgraphs is a popular branch: Graphs are represented as (usually binary) vectors, with components indicating whether a graph contains a particular subgraph that is frequent across the dataset.On large graphs, however, one faces the enormous problem that the number of these frequent subgraphs may grow exponentially with the size of the graphs, but only few of them possess enough discriminative power to make them useful for graph classification. Efficient and discriminative feature selection among frequent subgraphs is hence a key challenge for graph mining.In this article, we propose an approach to feature selection on frequent subgraphs, called CORK, that combines two central advantages. First, it optimizes a submodular quality criterion, which means that we can yield a near-optimal solution using greedy feature selection. Second, our submodular quality function criterion can be integrated into gSpan, the state-of-the-art tool for frequent subgraph mining, and help to prune the search space for discriminative frequent subgraphs even during frequent subgraph mining.

show abstract

Section: Discussionmentioning

confidence: 92%

Near-optimal supervised feature selection among frequent subgraphs

Thoma

Cheng

Gretton

et al. 2009

Proceedings of the 2009 SIAM International Conference on Data Mining

View full text Add to dashboard Cite

show abstract

“…Top-K most similar pattern use for the pattern-based classification is a common strategy [1,14]. Other strategies, such as, similarity between patterns [2] or emerging patterns [4], are used for the optimal pattern selection. In this paper we define similarity metric for pattern and context which incorporates hierarchical attribute structure.…”

Section: Related Workmentioning

confidence: 99%

Using Closed n-set Patterns for Spatio-Temporal Classification

Samulevicius

Pitarch

Pedersen

2014

Data Warehousing and Knowledge Discovery

View full text Add to dashboard Cite

OATAO is an open access repository that collects the work of Toulouse researchers and makes it freely available over the web where possible. This is an author-deposited version published in : http://oatao.univ-toulouse.fr/ Eprints ID : 15178The contribution was presented at DaWaK 2014:http://www.dexa.org/dawak2014 Abstract. Today, huge volumes of sensor data are collected from many different sources. One of the most crucial data mining tasks considering this data is the ability to predict and classify data to anticipate trends or failures and take adequate steps. While the initial data might be of limited interest itself, the use of additional information, e.g., latent attributes, spatio-temporal details, etc., can add significant values and interestingness. In this paper we present a classification approach, called Closed n-set Spatio-Temporal Classification (CnSC), which is based on the use of latent attributes, pattern mining, and classification model construction. As the amount of generated patterns is huge, we employ a scalable NoSQL-based graph database for efficient storage and retrieval. By considering hierarchies in the latent attributes, we define pattern and context similarity scores. The classification model for a specific context is constructed by aggregating the most similar patterns. Presented approach CnSC is evaluated with a real dataset and shows competitive results compared with other prediction strategies.

show abstract

“…Pre-pruning relational data has been a very promising research topic (Bringmann and Zimmermann, 2009). Cohen (1995) introduced a method to filter irrelevant literals out of relational examples in a text mining context.…”

Section: Related Workmentioning

confidence: 99%

Reducing the size of databases for multirelational classification: a subgraph-based approach

2012

View full text Add to dashboard Cite

/npsi/ctrl?lang=en http://nparc.cisti-icist.nrc-cnrc.gc.ca/npsi/ctrl?lang=fr Access and use of this website and the material on it are subject to the Terms and Conditions set forth at http://nparc.cisti-icist.nrc-cnrc.gc.ca/npsi/jsp/nparc_cp.jsp?lang=en NRC Publications Archive Archives des publications du CNRCThis publication could be one of several versions: author's original, accepted manuscript or the publisher's version. / La version de cette publication peut être l'une des suivantes : la version prépublication de l'auteur, la version acceptée du manuscrit ou la version de l'éditeur. For the publisher's version, please access the DOI link below./ Pour consulter la version de l'éditeur, utilisez le lien DOI ci-dessous.http://dx.doi.org/10.1007/s10844-012-0229-0 Systems, November 2012, 2012 Reducing the size of databases for multirelational classification : a subgraph-based approach Guo, Hongyu; Viktor, Herna L.; Paquet, Eric Abstract Multirelational classification aims to discover patterns across multiple interlinked tables (relations) in a relational database. In many large organizations, such a database often spans numerous departments and/or subdivisions, which are involved in different aspects of the enterprise such as customer profiling, fraud detection, inventory management, financial management, and so on. When considering classification, different phases of the knowledge discovery process are affected by economic utility. For instance, in the data preprocessing process, one must consider the cost associated with acquiring, cleaning, and transforming large volumes of data. When training and testing the data mining models, one has to consider the impact of the data size on the running time of the learning algorithm. In order to address these utility-based issues, the paper presents an approach to create a pruned database for multirelational classification, while minimizing predictive performance loss on the final model. Our method identifies a set of strongly uncorrelated subgraphs from the original database schema, to use for training, and discards all others. The experiments performed show that our strategy is able to, without sacrificing predictive accuracy, significantly reduce the size of the databases, in terms of the number of relations, tuples, and attributes. The approach prunes the sizes of databases by as much as 94%. Such reduction also results in decreasing computational cost of the learning process. The method improves the multirelational learning algorithms' execution time by as much as 80%. In particular, our results demonstrate that one may build an accurate model with only a small subset of the provided database. Journal of Intelligent Information

show abstract

One in a million: picking the right patterns

Cited by 33 publications

References 13 publications

Near-optimal supervised feature selection among frequent subgraphs

Near-optimal supervised feature selection among frequent subgraphs

Using Closed n-set Patterns for Spatio-Temporal Classification

Reducing the size of databases for multirelational classification: a subgraph-based approach

Contact Info

Product

Resources

About