Adaptive data reduction for large-scale transaction data

Li, Xiao Bai; Jacob, Varghese S.

doi:10.1016/j.ejor.2007.08.008

Cited by 32 publications

(8 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This means, in our problem context, as the amount and sensitivity of data increase, the data consumer’s utilities increase, but at a decreasing rate. This observation is well-grounded on the results of numerous prior analytical and empirical studies in the same or a similar context [23,24,9,19]. A typical example is the use of poll to estimate public opinion.…”

Section: The Proposed Pricing Schemementioning

confidence: 80%

Pricing and disseminating customer data with privacy awareness

Raghunathan

2014

Decision Support Systems

Self Cite

View full text Add to dashboard Cite

Organizations today regularly share their customer data with their partners to gain competitive advantages. They are also often requested or even required by a third party to provide customer data that are deemed sensitive. In these circumstances, organizations are obligated to protect the privacy of the individuals involved while still benefiting from sharing data or meeting the requirement for releasing data. In this study, we analyze the tradeoff between privacy and data utility from the perspective of the data owner. We develop an incentive-compatible mechanism for the data owner to price and disseminate private data. With this mechanism, a data user is motivated to reveal his true purpose of data usage and acquire the data that suits to that purpose. Existing economic studies of information privacy primarily consider the interplay between the data owner and the individuals, focusing on problems that occur in the collection of private data. This study, however, examines the privacy issue facing a data owner organization in the distribution of private data to a third party data user when the real purpose of data usage is unclear and the released data could be misused.

show abstract

Section: The Proposed Pricing Schemementioning

confidence: 80%

Pricing and disseminating customer data with privacy awareness

Raghunathan

2014

Decision Support Systems

Self Cite

View full text Add to dashboard Cite

show abstract

“…Obviously, releasing a perturbed (or even unperturbed) sample has lower disclosure risk than releasing the complete data set, because less information is released. However, as far as data utility is concerned, data-mining results based on a sample, even unperturbed, could be substantially different from those based on the complete set (Li and Jacob 2005). In terms of methodology, the approach proposed by Gouweleeuw et al (1998) works essentially on individual or blocks of attributes independently, therefore, "the precise effect on more complicated analyses, such as regression models, can be difficult to assess" (Fienberg and McIntyre 2004, p. 24).…”

Section: Related Workmentioning

confidence: 96%

Privacy Protection in Data Mining: A Perturbation Approach for Categorical Data

Sarkar

2006

Information Systems Research

Self Cite

View full text Add to dashboard Cite

T o respond to growing concerns about privacy of personal information, organizations that use their customers' records in data-mining activities are forced to take actions to protect the privacy of the individuals involved. A common practice for many organizations today is to remove identity-related attributes from the customer records before releasing them to data miners or analysts. We investigate the effect of this practice and demonstrate that many records in a data set could be uniquely identified even after identity-related attributes are removed. We propose a perturbation method for categorical data that can be used by organizations to prevent or limit disclosure of confidential data for identifiable records when the data are provided to analysts for classification, a common data-mining task. The proposed method attempts to preserve the statistical properties of the data based on privacy protection parameters specified by the organization. We show that the problem can be solved in two phases, with a linear programming formulation in Phase I (to preserve the first-order marginal distribution), followed by a simple Bayes-based swapping procedure in Phase II (to preserve the joint distribution). Experiments conducted on several real-world data sets demonstrate the effectiveness of the proposed method.

show abstract

“…In spite of this, there have been few studies focused on instance selection (or data reduction) for text classification. That is, if too many instances (i.e., documents) are adopted, it can result in large memory requirements and slow execution speed, and can cause over-sensitivity to noise [21,30]. Furthermore, one problem with using the original data points is that there may not be any located at the precise points that would make for the most accurate and concise concept description [23].…”

Section: Introductionmentioning

confidence: 99%