A general framework for estimating similarity of datasets and decision trees: exploring semantic similarity of decision trees

Ntoutsi, Irene; Kalousis, Alexandros; Theodoridis, Yannis

doi:10.1137/1.9781611972788.73

Cited by 20 publications

(18 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The stability of algorithms for statistical learning has been studied in statistics and machine learning for quite some time [see, e.g., Bousquet and Elissee, 2002, Breiman, 1996, Mukherjee et al, 2006, Poggio et al, 2004 mainly by labeling certain algorithms as being stable or not. Ideas for measuring the stability of results have been presented previously in Turney [1995], Lange et al [2002], Lim and Yu [2016], Briand et al [2009] andNtoutsi et al [2008] either for the classication or the regression case, but so far not for both.…”

Section: Related Work and Contributionmentioning

confidence: 99%

Measuring the Stability of Results From Supervised Statistical Learning

Philipp

Rusch

Hornik

et al. 2018

Journal of Computational and Graphical Statistics

View full text Add to dashboard Cite

Stability is a major requirement to draw reliable conclusions when interpreting results from supervised statistical learning. In this paper, we present a general framework for assessing and comparing the stability of results, that can be used in real-world statistical learning applications or in benchmark studies. We use the framework to show that stability is a property of both the algorithm and the data-generating process.In particular, we demonstrate that unstable algorithms (such as recursive partitioning) can produce stable results when the functional form of the relationship between the predictors and the response matches the algorithm. Typical uses of the framework in practice would be to compare the stability of results generated by dierent candidate algorithms for a data set at hand or to assess the stability of algorithms in a benchmark study. Code to perform the stability analyses is provided in the form of an R-package.

show abstract

Section: Related Work and Contributionmentioning

confidence: 99%

Measuring the Stability of Results From Supervised Statistical Learning

Philipp

Rusch

Hornik

et al. 2018

Journal of Computational and Graphical Statistics

View full text Add to dashboard Cite

show abstract

“…Ntoutsi et al [17] presented a general framework for similarity estimation that includes, as a special case, the estimation of semantic similarity between DTs. Zhang and Jiang [18] developed splitting criteria based on similarity.…”

Section: Related Workmentioning

confidence: 99%

“…While semantic similarity [17] measures the common feature subspace covered by two DTs for particular decision classes, structural similarity [22][23][24] compares two DTs structurally, focusing on their nodes, branches, total number of leaves, and tree depth. Dogra [7] provided a more complete similarity by taking into account both of these aspects.…”

Section: Related Workmentioning

confidence: 99%

DTreeSim: A new approach to compute decision tree similarity using re-mining

Bakırlı¹,

Birant²

2017

Turk J Elec Eng & Comp Sci

View full text Add to dashboard Cite

A number of recent studies have used a decision tree approach as a data mining technique; some of them needed to evaluate the similarity of decision trees to compare the knowledge reflected in different trees or datasets. There have been multiple perspectives and multiple calculation techniques to measure the similarity of two decision trees, such as using a simple formula or an entropy measure. The main objective of this study is to compute the similarity of decision trees using data mining techniques. This study proposes DTreeSim, a new approach that applies multiple data mining techniques (classification, sequential pattern mining, and k-nearest neighbors) sequentially to identify similarities among decision trees. After the construction of decision trees from different data marts using a classification algorithm, sequential pattern mining was applied to the decision trees to obtain rules, and then the k-nearest neighbor algorithm was performed on these rules to compute similarities using two novel measures: general similarity and pieced similarity.Our experimental studies compared the results of these novel similarity measures and also compared our approach with existing approaches. Our comparisons indicate that our proposed approach performs better than existing approaches, because it takes into account the values of the branches in the trees through sequential pattern mining.

show abstract

“…We can categorize big data approaches to decision tree induction as follows: building one big tree (Andrzejak et al, 2013;Panda et al, 2009;Ntoutsi et al, 2008;Zhang and Jiang, 2012;Pawlik and Augsten, 2011;Narlikar, 1998;Sreenivas et al, 2000;Goil and Choudhary, 2001;Amado et al, 2001;Domingos and Hulten, 2000;Dai and Ji, 2014), transferring all decision trees into one rule base and back into a decision tree, ensemble approaches (Louppe and Geurts, 2012;Hansen and Salamon, 1990;Sollich and Krogh, 1996;Breiman, 1999), and others (e.g., Kargupta and Park, 2004) that do not build a new tree and use a combination of tree results. According to Ben-Haim and Tom-Tov (2010), another way to categorize the different types of algorithms for handling large datasets is to divide them into the following two groups: pre-sorting of data and using approximate representations of data.…”

Section: Background and Related Workmentioning

confidence: 99%

“…This approach usually excels in accuracy but needs significant computing resources (Ben-Haim and Tom-Tov, 2010). The computing resources are needed for controlling the parallel stage and for dividing the database in a specific way (Panda et al, 2009) as well as for merging parts of trees in the post processing phase (Andrzejak et al, 2013;Panda et al, 2009;Ntoutsi et al, 2008;Zhang and Jiang, 2012). The need for extensive computational resources and the long processing time are considered major disadvantages in cases where fast results are needed for decision making.…”

Section: Background and Related Workmentioning

confidence: 99%

Interpretable decision-tree induction in a big data parallel framework

Weinberg

2017

International Journal of Applied Mathematics and Computer Science

View full text Add to dashboard Cite

When running data-mining algorithms on big data platforms, a parallel, distributed framework, such as MAPREDUCE, may be used. However, in a parallel framework, each individual model fits the data allocated to its own computing node without necessarily fitting the entire dataset. In order to induce a single consistent model, ensemble algorithms such as majority voting, aggregate the local models, rather than analyzing the entire dataset directly. Our goal is to develop an efficient algorithm for choosing one representative model from multiple, locally induced decision-tree models. The proposed SySM (syntactic similarity method) algorithm computes the similarity between the models produced by parallel nodes and chooses the model which is most similar to others as the best representative of the entire dataset. In 18.75% of 48 experiments on four big datasets, SySM accuracy is significantly higher than that of the ensemble; in about 43.75% of the experiments, SySM accuracy is significantly lower; in one case, the results are identical; and in the remaining 35.41% of cases the difference is not statistically significant. Compared with ensemble methods, the representative tree models selected by the proposed methodology are more compact and interpretable, their induction consumes less memory, and, as confirmed by the empirical results, they allow faster classification of new records.

show abstract

A general framework for estimating similarity of datasets and decision trees: exploring semantic similarity of decision trees

Cited by 20 publications

References 10 publications

Measuring the Stability of Results From Supervised Statistical Learning

Measuring the Stability of Results From Supervised Statistical Learning

DTreeSim: A new approach to compute decision tree similarity using re-mining

Interpretable decision-tree induction in a big data parallel framework

Contact Info

Product

Resources

About