When genetic algorithms are used to evolve decision trees, key tree quality parameters can be recursively computed and re-used across generations of partially similar decision trees. Simply storing instance indices at leaves is sufficient for fitness to be piecewise computed in a lossless fashion. We show the derivation of the (substantial) expected speedup on two bounding case problems and trace the attractive property of lossless fitness inheritance to the divide-and-conquer nature of decision trees. The theoretical results are supported by experimental evidence.
Abstract.1 This paper focuses on preserving the privacy of sensitive patterns when inducing decision trees. We adopt a record augmentation approach for hiding sensitive classification rules in binary datasets. Such a hiding methodology is preferred over other heuristic solutions like output perturbation or cryptographic techniques -which restrict the usability of the data -since the raw data itself is readily available for public use. We show some key lemmas which are related to the hiding process and we also demonstrate the methodology with an example and an indicative experiment using a prototype hiding tool. INTRODUCTIONPrivacy preserving data mining [1] is a quite recent research area trying to alleviate the problems stemming from the use of data mining algorithms to the privacy of the data subjects recorded in the data and the information or knowledge hidden in these piles of data. Agrawal and Srinkant [2] were the first to consider the induction of decision trees from anonymized data, which had been adequately corrupted with noise to survive from privacy attacks. The generic strand of knowledge hiding research [3] has led to specific algorithms for hiding classification rules, like, for example, noise addition by a data swapping process [4]. A key target area concerns individual data privacy and aims to protect the individual integrity of database records to prevent the reidentification of individuals or characteristic groups of people from data inference attacks. Another key area is sensitive rule hiding, the subject of this paper, which deals with the protection of sensitive patterns that arise from the application of data mining techniques. Of course, all privacy preservation techniques strive to maintain data information quality.The main representative of statistical approaches [5] adopts a parsimonious downgrading technique to determine whether the loss of functionality associated with not downgrading the data, is worth the extra confidentiality. Reconstruction techniques involve the redesign of the public dataset [6][7] from the non-sensitive rules produced by algorithms like C4.5 [8] and RIPPER [9]. Perturbation based techniques involve the modification of transactions to support only non-sensitive rules [10], the removal of tuples associated with sensitive rules [11], the suppression of certain attribute values [12] and the redistribution of tuples supporting sensitive patterns so as to maintain the ordering of the rules [13].In this paper, we propose a series of techniques to efficiently protect the disclosure of sensitive knowledge patterns in classification rule mining. We aim to hide sensitive rules without 1 School of Science and Technology, Hellenic Open University, Patras, Greece, email: kalles@eap.gr, verykios@eap.gr, georgios.feretzakis@ac.eap.gr 2 Epignosis Ltd, Athens, Greece, email: papagel@efrontlearning.net compromising the information value of the entire dataset. After an expert selects the sensitive rules, we modify class labels at the tree node corresponding to the tail of the sen...
This work deals with stability in incremental induction of decision trees. Stability problems arise when an induction algorithm must revise a decision tree very often and oscillations between similar concepts decrease learning speed. We introduce a heuristic and an algorithm with theoretical and experimental backing to tackle this problem.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.