Scalable Distributed Data Anonymization for Large Datasets

Vimercati, Sabrina De Capitani di; Facchinetti, Dario; Foresti, Sara; Livraga, Giovanni; Oldani, Gianluca; Paraboschi, Stefano; Rossi, Matthew; Samarati, Pierangela

doi:10.1109/tbdata.2022.3207521

Cited by 1 publication

(5 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The H-PGPkAA algorithm showed good acceleration with slightly equal information loss as that of the GkAA algorithm. Additionally, we believe that the data reorganization used in these algorithms can be applied to improve data utility while achieving k-anonymity in other distributed parallel algorithms employing horizontal data partitioning, including the GCCG algorithm [10] or the extended Mondrian algorithm [31].…”

Section: Discussionmentioning

confidence: 99%

“…There has been extensive research conducted on distributed privacy preserving data publishing over the last twenty years [27]- [31]. Within this domain, the primary challenges revolve around data partitioning and aggregation strategies.…”

Section: A Related Workmentioning

confidence: 99%

“…In a later study conducted in 2022, researchers explored an approach to scalable data anonymization, as detailed in [31]. This approach was designed to address the challenges of anonymizing large datasets within a distributed memory environment, specifically utilizing Apache Hadoop.…”

Section: A Related Workmentioning

confidence: 99%

“…This challenge becomes particularly pronounced as the number of processors (or workers) and the value of k (the anonymity constraint) increase, or when dealing with relatively small datasets. The earlier study conducted and detailed in [31] was restricted to a mere 12 workers. Consequently, this limitation failed to provide a comprehensive view of the information loss, particularly for higher k values, such as k = 100, resulting from excessive dataset partitioning.…”

Section: B Our Contributionmentioning

confidence: 99%

“…It is important to acknowledge that the partitioning process may potentially introduce some disruption to the obtained results, with a slight increase in terms of information loss [10], [31] compared to the centralized GkAA algorithm [12]. However, we anticipate that the time savings achieved through our method will outweigh this trade-off.…”

Section: B Pgpkaa: Cgm-based Parallel Algorithm Based On the Partitio...mentioning

confidence: 99%

See 4 more Smart Citations

Achieving k-anonymity of a large-scale database in a distributed memory environment

Vadèle,

Nanfack,

Mahec

et al. 2024

Preprint

View full text Add to dashboard Cite

The k-anonymity problem introduced by Samarati and Sweeney in 1998, guarantees that it is impossible to distinguish user data from at least (k − 1) others in the same database. The methods used to achieve k-anonymity result in an information loss as the data in the database is modified, making it less accurate through a process of generalization or micro-aggregation of the stored data. Mauger et al. proposed a O(n²)-time sequential algorithm that gives good results while minimizing the information loss using their designed metrics. However, their solution is very time-consuming and therefore not suitable for large-scale databases. In this paper, we tackle this problem using parallelism. We propose three coarse-grained parallel algorithms to solve the k-anonymity problem. The first is the straightforward algorithm that runs in O(n ²/p) execution time with O(n) communication rounds, where n is the number of lines in the database and p is the number of processors. The second runs in O(n² /p²) execution time with O²(p) communication rounds. The third runs in O(n² /plog2(p)) execution time with O(np/ (log2(p))²) communications rounds. For the latter two algorithms, we introduce the concept of data reorganization to minimize the information loss when data are partitioned. Experimental results show that for a database of size n = 10^6 , p = 2^7 , and k = 10^2 , second, and third parallel algorithms are respectively 1127.59× and 41.13× faster than the sequential algorithm while achieving anonymity with 4.03% and 2.62% information loss.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: A Related Workmentioning

confidence: 99%

Section: A Related Workmentioning

confidence: 99%

Section: B Our Contributionmentioning

confidence: 99%

Section: B Pgpkaa: Cgm-based Parallel Algorithm Based On the Partitio...mentioning

confidence: 99%

See 3 more Smart Citations

Achieving k-anonymity of a large-scale database in a distributed memory environment

Vadèle,

Nanfack,

Mahec

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

Scalable Distributed Data Anonymization for Large Datasets

Cited by 1 publication

References 36 publications

Achieving k-anonymity of a large-scale database in a distributed memory environment

Achieving k-anonymity of a large-scale database in a distributed memory environment

Contact Info

Product

Resources

About