Abstract-We focus on the privacy-utility trade-off encountered by users who wish to disclose some information to an analyst, that is correlated with their private data, in the hope of receiving some utility. We rely on a general privacy statistical inference framework, under which data is transformed before it is disclosed, according to a probabilistic privacy mapping. We show that when the log-loss is introduced in this framework in both the privacy metric and the distortion metric, the privacy leakage and the utility constraint can be reduced to the mutual information between private data and disclosed data, and between non-private data and disclosed data respectively. We justify the relevance and generality of the privacy metric under the log-loss by proving that the inference threat under any bounded cost function can be upperbounded by an explicit function of the mutual information between private data and disclosed data. We then show that the privacy-utility tradeoff under the log-loss can be cast as the non-convex Privacy Funnel optimization, and we leverage its connection to the Information Bottleneck, to provide a greedy algorithm that is locally optimal. We evaluate its performance on the US census dataset.
We propose a practical methodology to protect a user's private data, when he wishes to publicly release data that is correlated with his private data, in the hope of getting some utility. Our approach relies on a general statistical inference framework that captures the privacy threat under inference attacks, given utility constraints. Under this framework, data is distorted before it is released, according to a privacy-preserving probabilistic mapping. This mapping is obtained by solving a convex optimization problem, which minimizes information leakage under a distortion constraint. We address practical challenges encountered when applying this theoretical framework to real world data. On one hand, the design of optimal privacypreserving mechanisms requires knowledge of the prior distribution linking private data and data to be released, which is often unavailable in practice. On the other hand, the optimization may become untractable and face scalability issues when data assumes values in large size alphabets, or is high dimensional. Our work makes three major contributions. First, we provide bounds on the impact on the privacy-utility tradeoff of a mismatched prior. Second, we show how to reduce the optimization size by introducing a quantization step, and how to generate privacy mappings under quantization. Third, we evaluate our method on three datasets, including a new dataset that we collected, showing correlations between political convictions and TV viewing habits. We demonstrate that good privacy properties can be achieved with limited distortion so as not to undermine the original purpose of the publicly released data, e.g. recommendations.
In September 2017, McAffee Labs quarterly report [2] estimated that brute force attacks represent 20% of total network attacks, making them the most prevalent type of attack ex-aequo with browser based vulnerabilities. These attacks have sometimes catastrophic consequences, and understanding their fundamental limits may play an important role in the risk assessment of password-secured systems, and in the design of better security protocols. While some solutions exist to prevent online brute-force attacks that arise from one single IP address, attacks performed by botnets are more challenging. In this paper, we analyze these distributed attacks by using a simplified model. Our aim is to understand the impact of distribution and asynchronization on the overall computational effort necessary to breach a system. Our result is based on Guesswork, a measure of the number of queries (guesses) required of an adversary before a correct sequence, such as a password, is found in an optimal attack. Guesswork is a direct surrogate for time and computational effort of guessing a sequence from a set of sequences with associated likelihoods. We model the lack of synchronization by a worst-case optimization in which the queries made by multiple adversarial agents are received in the worst possible order for the adversary, resulting in a min-max formulation. We show that, even without synchronization, and for sequences of growing length, the asymptotic optimal performance is achievable by using randomized guesses drawn from an appropriate distribution. Therefore, randomization is key for distributed asynchronous attacks. In other words, asynchronous guessers can asymptotically perform brute-force attacks as efficiently as synchronized guessers. I. INTRODUCTIONFrom online banking [3] and bitcoin wallets [4], to secure shell (SSH), file transfer protocol (ftp), and telnet servers [5], and passing by governmental institutions [6], brute-force attacks have shown to be one of the major threats to network security. Despite the computational burden on the attacker, brute-force attacks are prevalent. This can be explained through multiple points of view. First, passwords are often weaker than what they ought to be, meaning that attackers can hope to find the correct password well before they query a significant portion of the possible password strings. Next, attacks through huge networks of compromised computers (botnets) are now more common, giving access to significant computational resources for the attacker. More critically, these botnets help to disguise the attack by distributing it. Indeed, a main solution to the threat of online brute-force attacks is to setup a system that detects and prevents too many queries from any one user, as determined by IP addresses. As such, an attacker which uses only a single IP address would be limited to a fixed number of guesses. In recent years, however, this defense was circumvented by using massive botnets, each bot querying potential passwords. In this situation, it is hard to detect legitimat...
Given a pair of random variables (X, Y ) ∼ PXY and two convex functions f1 and f2, we introduce two bottleneck functionals as the lower and upper boundaries of the two-dimensional convex set that consists of the pairs, where I f denotes f -information and W varies over the set of all discrete random variables satisfying the Markov condition W → X → Y . Applying Witsenhausen and Wyner's approach, we provide an algorithm for computing boundaries of this set for f1, f2, and discrete PXY . In the binary symmetric case, we fully characterize the set when (i) f1(t) = f2(t) = t log t, (ii) f1(t) = f2(t) = t 2 − 1, and (iii) f1 and f2 are both ℓ β norm function for β ≥ 2. We then argue that upper and lower boundaries in (i) correspond to Mrs. Gerber's Lemma and its inverse (which we call Mr. Gerber's Lemma), in (ii) correspond to estimation-theoretic variants of Information Bottleneck and Privacy Funnel, and in (iii) correspond to Arimoto Information Bottleneck and Privacy Funnel.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.