DP-VAE: Human-Readable Text Anonymization for Online Reviews with Differentially Private Variational Autoencoders

Weggenmann, Benjamin; Rublack, Valentin; Andrejczuk, Michael; Mattern, Justus; Kerschbaum, Florian

doi:10.1145/3485447.3512232

Cited by 4 publications

(3 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Additionally, since multiple features can be created for each of the 𝑘 occurrences of a term, we define two data sets as being adjacent if they differ by all 𝐾 features associated with a given term, |𝑥| − |𝑥 ′ | = 𝐾, 𝐾 ≥ 𝑘. Three previous works have explored providing actual user-level differential privacy against text-based linkage attacks 24,25,26 , also known as authorship attribution attacks. At first glance this may sound similar to our work here, however there is key difference: these works focus on protecting the privacy of the people providing the text, rather than protecting the confidentiality of the text itself.…”

Section: User-level Privacy As Term-level Privacymentioning

confidence: 99%

Hash the Universe: Differentially Private Text Extraction with Feature Hashing

Fletcher,

Roegiest,

Hudek

2024

Preprint

View full text Add to dashboard Cite

Using artificial intelligence for text extraction can often require handling privacy-sensitive text. To avoid revealing confidential information, data owners and practitioners can use differential privacy, a definition of privacy with provable guarantees. In this work, we show how differential privacy can be applied to feature hashing. Feature hashing is a common technique for handling out-of-dictionary vocabulary, and for creating a lookup table to find feature weights in constant time. One of the special qualities of feature hashing is that all possible features are mapped to a discrete, finite output space. Our proposed technique takes advantage of this fact, and makes hashed feature sets Renyi-differentially private. The technique enables data owners to privatize any model that stores the data-dependent weights in a hash table, and provides protection against inference attacks on the model output. As a case study, we show how we have implemented our technique in commercial software that enables users to train text sequence classifiers on their own documents, and share the classifiers with other users without leaking training data. Results show that even common words can be protected with 0.06, 10^-5)-differential privacy, with only a 1% average reduction in Recall and no change in Precision.

show abstract

Section: User-level Privacy As Term-level Privacymentioning

confidence: 99%

Hash the Universe: Differentially Private Text Extraction with Feature Hashing

Fletcher,

Roegiest,

Hudek

2024

Preprint

View full text Add to dashboard Cite

show abstract

“…To generate human-readable text, Bo et al (2021) employ an encoder-decoder model similar to ours, but without paraphrasing, and sample output words using (a two-set variant of) the Exponential mechanism (McSherry and Talwar, 2007). Weggenmann et al (2022) propose a differentially private variation of the variational autoencoder and use it as a sequenceto-sequence architecture for text anonymization.…”

Section: Related Workmentioning

confidence: 99%

The Limits of Word Level Differential Privacy

Justus¹,

Weggenmann²,

Kerschbaum³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

As the issues of privacy and trust are receiving increasing attention within the research community, various attempts have been made to anonymize textual data. A significant subset of these approaches incorporate differentially private mechanisms to perturb word embeddings, thus replacing individual words in a sentence. While these methods represent very important contributions, have various advantages over other techniques and do show anonymization capabilities, they have several shortcomings. In this paper, we investigate these weaknesses and demonstrate significant mathematical constraints diminishing the theoretical privacy guarantee as well as major practical shortcomings with regard to the protection against deanonymization attacks, the preservation of content of the original sentences as well as the quality of the language output. Finally, we propose a new method for text anonymization based on transformer based language models fine-tuned for paraphrasing that circumvents most of the identified weaknesses and also offers a formal privacy guarantee. We evaluate the performance of our method via thorough experimentation and demonstrate superior performance over the discussed mechanisms.

show abstract

“…We furthermore use a domain independent LDP mechanism specifically for VAE, to which we refer as VAE-LDP. VAE-LDP by Weggenmann et al [37] allows a data scientist to use VAE as LDP mechanism to perturb data. This is achieved by limiting the encoders mean and adding noise to the encoders standard deviation before sampling the latent code z during training.…”

Section: Differential Privacymentioning

confidence: 99%

Assessing Differentially Private Variational Autoencoders under Membership Inference

Bernau¹,

Robl²,

Kerschbaum³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

We present an approach to quantify and compare the privacy-accuracy trade-off for differentially private Variational Autoencoders. Our work complements previous work in two aspects. First, we evaluate the the strong reconstruction MI attack against Variational Autoencoders under differential privacy. Second, we address the data scientist's challenge of setting privacy parameter , which steers the differential privacy strength and thus also the privacy-accuracy trade-off. In our experimental study we consider image and time series data, and three local and central differential privacy mechanisms. We find that the privacy-accuracy trade-offs strongly depend on the dataset and model architecture. We do rarely observe favorable privacy-accuracy trade-off for Variational Autoencoders, and identify a case where LDP outperforms CDP.

show abstract

DP-VAE: Human-Readable Text Anonymization for Online Reviews with Differentially Private Variational Autoencoders

Cited by 4 publications

References 47 publications

Hash the Universe: Differentially Private Text Extraction with Feature Hashing

Hash the Universe: Differentially Private Text Extraction with Feature Hashing

The Limits of Word Level Differential Privacy

Assessing Differentially Private Variational Autoencoders under Membership Inference

Contact Info

Product

Resources

About