Proceedings of the 2008 SIAM International Conference on Data Mining 2008
DOI: 10.1137/1.9781611972788.73
|View full text |Cite
|
Sign up to set email alerts
|

A general framework for estimating similarity of datasets and decision trees: exploring semantic similarity of decision trees

Abstract: Decision trees are among the most popular pattern types in data mining due to their intuitive representation. However, little attention has been given on the definition of measures of semantic similarity between decision trees. In this work, we present a general framework for similarity estimation that includes as special cases the estimation of semantic similarity between decision trees, as well as various forms of similarity estimation on classification datasets with respect to different probability distribu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
16
0
2

Year Published

2011
2011
2023
2023

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 20 publications
(18 citation statements)
references
References 10 publications
0
16
0
2
Order By: Relevance
“…The stability of algorithms for statistical learning has been studied in statistics and machine learning for quite some time [see, e.g., Bousquet and Elissee, 2002, Breiman, 1996, Mukherjee et al, 2006, Poggio et al, 2004 mainly by labeling certain algorithms as being stable or not. Ideas for measuring the stability of results have been presented previously in Turney [1995], Lange et al [2002], Lim and Yu [2016], Briand et al [2009] andNtoutsi et al [2008] either for the classication or the regression case, but so far not for both.…”
Section: Related Work and Contributionmentioning
confidence: 99%
“…The stability of algorithms for statistical learning has been studied in statistics and machine learning for quite some time [see, e.g., Bousquet and Elissee, 2002, Breiman, 1996, Mukherjee et al, 2006, Poggio et al, 2004 mainly by labeling certain algorithms as being stable or not. Ideas for measuring the stability of results have been presented previously in Turney [1995], Lange et al [2002], Lim and Yu [2016], Briand et al [2009] andNtoutsi et al [2008] either for the classication or the regression case, but so far not for both.…”
Section: Related Work and Contributionmentioning
confidence: 99%
“…Ntoutsi et al [17] presented a general framework for similarity estimation that includes, as a special case, the estimation of semantic similarity between DTs. Zhang and Jiang [18] developed splitting criteria based on similarity.…”
Section: Related Workmentioning
confidence: 99%
“…While semantic similarity [17] measures the common feature subspace covered by two DTs for particular decision classes, structural similarity [22][23][24] compares two DTs structurally, focusing on their nodes, branches, total number of leaves, and tree depth. Dogra [7] provided a more complete similarity by taking into account both of these aspects.…”
Section: Related Workmentioning
confidence: 99%
“…We can categorize big data approaches to decision tree induction as follows: building one big tree (Andrzejak et al, 2013;Panda et al, 2009;Ntoutsi et al, 2008;Zhang and Jiang, 2012;Pawlik and Augsten, 2011;Narlikar, 1998;Sreenivas et al, 2000;Goil and Choudhary, 2001;Amado et al, 2001;Domingos and Hulten, 2000;Dai and Ji, 2014), transferring all decision trees into one rule base and back into a decision tree, ensemble approaches (Louppe and Geurts, 2012;Hansen and Salamon, 1990;Sollich and Krogh, 1996;Breiman, 1999), and others (e.g., Kargupta and Park, 2004) that do not build a new tree and use a combination of tree results. According to Ben-Haim and Tom-Tov (2010), another way to categorize the different types of algorithms for handling large datasets is to divide them into the following two groups: pre-sorting of data and using approximate representations of data.…”
Section: Background and Related Workmentioning
confidence: 99%
“…This approach usually excels in accuracy but needs significant computing resources (Ben-Haim and Tom-Tov, 2010). The computing resources are needed for controlling the parallel stage and for dividing the database in a specific way (Panda et al, 2009) as well as for merging parts of trees in the post processing phase (Andrzejak et al, 2013;Panda et al, 2009;Ntoutsi et al, 2008;Zhang and Jiang, 2012). The need for extensive computational resources and the long processing time are considered major disadvantages in cases where fast results are needed for decision making.…”
Section: Background and Related Workmentioning
confidence: 99%