Towards Semantic Microaggregation of Categorical Data for Confidential Documents

Abril, Daniel; Navarro‐Arribas, Guillermo; Torra, Vicenç

doi:10.1007/978-3-642-16292-3_26

Cited by 16 publications

(16 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, the accurate centroid calculus for non-numerical data is challenging due to the lack of semantic aggregation operators and the necessity of considering a discrete set of centroid values. Related works propose methods to compute centroids for non-numerical data either relying on the distributional features of data, where the centroid is the modal value [23], or on background semantic, where the centroid is the term that generalises all aggregated values in a taxonomy [24]. Since only one dimension of data (distribution or semantics) is considered, both approaches result in suboptimal results [25].…”

Section: The Centroid Of Categorical Valuesmentioning

confidence: 98%

“…Since arithmetic functions cannot be applied to this kind of data, a straightforwardway to apply MDAV to categorical data consists on using Boolean equality/inequality operators [18,23] or to use the common abstraction of a set of values in an ontology as the centroid [24].…”

Section: Semantic Microaggregationmentioning

confidence: 99%

“…This is related to the fact that, unlike other masking methods [22][23][24], the Heer's approach was designed without considering k-anonymity (formalized years later in [13]). Hence, resampled results cannot guarantee an a priori level of privacy.…”

Section: Semantic Resamplingmentioning

confidence: 99%

See 2 more Smart Citations

Semantic Anonymisation of Categorical Datasets

Mart́ınez

Valls

Sánchez

2014

Studies in Computational Intelligence

View full text Add to dashboard Cite

The exploitation of microdata compiled by statistical agencies is of great interest for the data mining community. However, such data often include sensitive information that can be directly or indirectly related to individuals. Hence, an appropriate anonymisation process is needed to minimise the risk of disclosing identities and/or confidential data. In the past, many anonymisation methods have been developed to deal with numerical data, but approaches tackling the anonymisation of non-numerical values (e.g. categorical, textual) are scarce and shallow. Since the utility of this kind of information is closely related to the preservation of its meaning, in this work, the notion of semantic similarity is used to enable a semantically coherent anonymisation. The knowledge modelled in ontologies is used as the basic pillar to propose semantic operators that enable an accurate management and transformation of categorical attributes. These operators are then used in three anonymisation mechanisms: Semantic Recoding, Semantic and Adaptive Microaggregation and Semantic Resampling. The three algorithms are compared in terms of semantic utility, privacy disclosure risk and runtime, with encouraging results.

show abstract

Section: The Centroid Of Categorical Valuesmentioning

confidence: 98%

Section: Semantic Microaggregationmentioning

confidence: 99%

See 1 more Smart Citation

Semantic Anonymisation of Categorical Datasets

Mart́ınez

Valls

Sánchez

2014

Studies in Computational Intelligence

View full text Add to dashboard Cite

show abstract

“…In [1] authors use the WordNet structured thesaurus [37] as ontology to assist the classification and masking of confidential textual documents. WordNet models and semantically interlinks more than 100,000 concepts referred by means of English textual labels.…”

Section: Related Workmentioning

confidence: 99%

“…Hence, when constructing the centroid of a dataset with textual attributes, the similarities between their meanings (evaluated at a conceptual level) should be taken into consideration (e.g., for hobbies attribute, ''trekking'' value is more similar to ''jogging'' than to ''classical dance''). Related works constructing centroids for textual attribute values typically omit [16,46] or slightly consider [1,22] data semantics during their analysis, hampering the quality of the results [6,33].…”

Section: Introductionmentioning

confidence: 99%