Abstract. We examine a recent proposal for data-privatization by testing it against well-known attacks; we show that all of these attacks successfully retrieve a relatively large (and unacceptable) portion of the original data. We then indicate how the data-privatization method examined can be modified to assist it to withstand these attacks and compare the performance of the two approaches. We also show that the new method has better privacy and lower information loss than the former method.Keywords: data-privatization, information loss, Chebyshev polynomial, Spectral Filtering, Bayes-Estimated Data Reconstruction, data mining.
1Introduction and Background
Data-PrivatizationPrivacy preservation is an important issue in many data mining applications dealing with sensitive data such as health-care records. Privacy preserving data mining (PPDM) has become an important enabling technology for integrating data and determining interesting patterns from private collections of databases, thus improving productivity and competitiveness for many businesses. PPDM requires data modification which limits information loss (thus increasing utility) as it is intended that a legitimate receiver of the modified data be able to recover the original data needed for a response. Perturbation techniques have to manage the intrinsic trade-off between preserving data privacy and information loss, as each affects the other. Several perturbation techniques [1]- [5] have been proposed for mining purposes, but in all these papers, privacy and utility are not satisfactorily balanced. In the research literature, there are two general approaches to privacy preserving data mining: the randomization approach [1] and the secure multi-party computation approach [6]. We focus only on the former because it can distort data more efficiently than the latter. There are two major randomization methods: Random Perturbation [2] and Randomized Response [5]. The former is a technique which deals mostly with numerical data, perturbing attribute by attribute, and concentrating on a statistical analysis of the data; it is a well-studied sanitization method that simultaneously allows access to the data by publishing them and at the same time preserving the privacy of the data. Randomized Response perturbs multiple attributes rather than one at a time, and so we ignore this method.