Decision Tree Classification with Differential Privacy

Fletcher, Sam; Islam, Zahidul

doi:10.1145/3337064

Cited by 88 publications

(42 citation statements)

References 60 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The decision tree is a supervised machine learning method. Its basic idea is to classify samples layer by layer by selecting feature attributes and realize an agent based on feature judgment for data classification, feature selection, and other scenarios [ 38 ]. As shown in Figure 2 , the decision tree algorithm will divide the samples layer by layer according to their attribute values and obtain the judgment results under different attribute combinations, thus forming a tree structure.…”

Section: Methodsmentioning

confidence: 99%

Using the Machine Learning Method to Study the Environmental Footprints Embodied in Chinese Diet

Liang

Han

Chai

et al. 2020

IJERPH

View full text Add to dashboard Cite

The food system profoundly affects the sustainable development of the environment and resources. Numerous studies have shown that the food consumption patterns of Chinese residents will bring certain pressure to the environment. Food consumption patterns have individual differences. Therefore, reducing the pressure of food consumption patterns on the environment requires the precise positioning of people with high consumption tendencies. Based on the related concepts of the machine learning method, this paper designs an identification method of the population with a high environmental footprint by using a decision tree as the core and realizes the automatic identification of a large number of users. By using the microdata provided by CHNS(the China Health and Nutrition Survey), we study the relationship between residents’ dietary intake and environmental resource consumption. First, we find that the impact of residents’ food system on the environment shows a certain logistic normal distribution trend. Then, through the decision tree algorithm, we find that four demographic characteristics of gender, income level, education level, and region have the greatest impact on residents’ environmental footprint, where the consumption trends of different characteristics are also significantly different. At the same time, we also use the decision tree to identify the population characteristics with high consumption tendency. This method can effectively improve the identification coverage and accuracy rate and promotes the improvement of residents’ food consumption patterns.

show abstract

Section: Methodsmentioning

confidence: 99%

Using the Machine Learning Method to Study the Environmental Footprints Embodied in Chinese Diet

Liang

Han

Chai

et al. 2020

IJERPH

View full text Add to dashboard Cite

show abstract

“…A Classification and Regression Tree (CART) is a popular binary decision tree that can be used for classification or regression analysis [10], [33], [41]. This paper considers to reduce the complexity of the model while increasing the diversity of the model to maintain a certain degree of accuracy of the base classifier.…”

Section: B Classification and Regression Tree 1) Improved Cartmentioning

confidence: 99%

Research on an Ensemble Classification Algorithm Based on Differential Privacy

Jia

Qiu

2020

IEEE Access

View full text Add to dashboard Cite

In the field of information security, privacy protection based on machine learning is currently a hot topic. Combining differential privacy protection with AdaBoost, a machine learning ensemble classification algorithm, this paper proposes a scheme under differential privacy named CART-DPsAdaBoost (CART-Differential privacy structure of AdaBoost). In the process of boosting, the algorithm combines the idea of bagging, and uses a classification and regression tree (CART) stump as the base learner for ensemble learning. Applying feature perturbation, based on a random subspace algorithm, the exponential mechanism is used to select the splitting point for continuous attributes. We use the Gini index to find the optimal binary partitioning point for discrete attributes and add noise according to the Laplace mechanism. Throughout the process, a privacy budget is allocated in order to meet the appropriate differential privacy protection needs for the current application. Unlike similar algorithms, this method does not require discretization during preprocessing of the data. Experimental results with the Census Income, Digit Recognizer, and Adult Data Set show that while protecting private information, the scheme has little impact on classification accuracy and can effectively address large-scale and high-dimensional data classification problems.

show abstract

“…Additionally, some tree‐based differentially private classification algorithms have been proposed in the literature (Blum et al, 2005; Fletcher & Islam, 2017, 2019; Jagannathan et al, 2009; Jagannathan, Monteleoni, & Pillaipakkamnatt, 2013; Patil & Singh, 2014; Rana, Gupta, & Venkatesh, 2015). In 2005, a differentially private version of ID3 (Quinlan, 1992), in which the information gain is estimated with the help of output perturbation by adding noise drawn from Laplace distribution to the results of the count queries, has been proposed (Blum et al, 2005).…”

Section: Differentially Private Classificationmentioning

confidence: 99%

“…To provide data security, differential privacy adds random noise drawn from a distribution such as Laplace , to the functions running on sensitive data. There exist three ways to provide differential privacy guarantee: (a) input perturbation (Ji, Lipton, & Elkan, 2014; Mivule, Turner, & Ji, 2012; Sánchez, Domingo‐Ferrer, Martínez, & Soria‐Comas, 2016; Sarwate & Chaudhuri, 2013; Xu, Yang, & Bai, 2019), (b) objective perturbation (Chaudhuri & Monteleoni, 2008; Chaudhuri, Monteleoni, & Sarwate, 2011; Fukuchi, Tran, & Sakuma, 2017; Ji et al, 2014; Rubinstein, Bartlett, Huang, & Taft, 2009; Zhang, Zhang, Xiao, Yang, & Winslett, 2012), and (c) output perturbation (Bojarski, Choromanska, Choromanski, & LeCun, 2014; Fletcher & Islam, 2015, 2019; Friedman & Schuster, 2010; Gursoy, Inan, Nergiz, & Saygin, 2017; Xu et al, 2019). All the three methods add some random noise during the data analysis process to protect individual's privacy.…”

Section: Introductionmentioning

confidence: 99%

“…In the literature, differentially private classification algorithms based on k ‐NN, ID3, random decision trees, and forests, Naïve Bayes, SVM, and Holte's One Rule have been proposed (Bojarski et al, 2014; Fletcher & Islam, 2015, 2019; Friedman & Schuster, 2010; Gursoy et al, 2017; Rubinstein et al, 2009; Senekane, 2019; Vaidya, Shafiq, Basu, & Hong, 2013; Zhang, Hao, & Wang, 2019; Zorarpacı & Özel, 2020). Briefly overviewing the literature concerning classification with differential privacy, it can be seen that, the majority of the existing methods employ output perturbation technique from differential privacy.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Privacy preserving classification over differentially private data

Zorarpacı

Özel

2020

WIREs Data Min & Knowl

View full text Add to dashboard Cite

Privacy preserving data classification is an important research area in data mining field. The goal of a privacy preserving classification algorithm is to protect the sensitive information as much as possible, while providing satisfactory classification accuracy. Differential privacy is a strong privacy guarantee that enables privacy of sensitive data stored in a database by determining the ratio of sensitive information leakage with respect to an ɛ parameter. In this study, our aim is to investigate the classification performance of the state‐of‐the‐art classification algorithms such as C4.5, Naïve Bayes, One Rule, Bayesian Networks, PART, Ripper, K*, IBk, and Random tree for performing privacy preserving classification. To preserve privacy of the data to be classified, we applied input perturbation technique coming from differential privacy, and observed the relationship between the ɛ parameter values and accuracy of the classifiers. To our best knowledge, this article is the first study that analyzes the performances of the well‐known classification algorithms over differentially private data, and discovers which datasets are more suitable for privacy preserving classification when input perturbation is applied to provide data privacy. The classification algorithms are compared by using the differentially private versions of the well‐known datasets from the UCI repository. According to the experimental results, we observed that, as ɛ parameter value increases, better classification accuracies are achieved with lower privacy levels. When the classifiers are compared, Naïve Bayes classifier is the most successful method. The ɛ parameter should be greater than or equal to 2 (i.e., ɛ ≥2) to achieve cloud server is malicious and untrusted, sensitive data will satisfactory classification accuracies. This article is categorized under: Commercial, Legal, and Ethical Issues > Security and Privacy Technologies > Classification

show abstract

Decision Tree Classification with Differential Privacy

Cited by 88 publications

References 60 publications

Using the Machine Learning Method to Study the Environmental Footprints Embodied in Chinese Diet

Using the Machine Learning Method to Study the Environmental Footprints Embodied in Chinese Diet

Research on an Ensemble Classification Algorithm Based on Differential Privacy

Privacy preserving classification over differentially private data

Contact Info

Product

Resources

About