Abstract:
Redacting text documents has traditionally been a mostly manual activity, making it expensive and prone to disclosure risks. This paper describes a semi-automated system to en- sure a specified level of privacy in text data sets. Recent work has attempted to quantify the likelihood of privacy breaches for text data. We build on these notions to provide a means of obstructing such breaches by framing it as a multi-class classification problem. Our system gives users fine-grained control ove… Show more
“…Cumby and Ghani propose a sensitive data recognition technique based on machine learning that utilizes contextual semantic information to identify and detect sensitive content. 7 Chen et al propose a non-parametric Bayesian hidden Markov model based on a Dirichlet process for medical record de-identification. Without manual task-specific feature engineering, the model can perform as accurately as conditional random field (CRF) models in several categories.…”
Section: Data Identification and Desensitizationmentioning
confidence: 99%
“…Cumby and Ghani propose a sensitive data recognition technique based on machine learning that utilizes contextual semantic information to identify and detect sensitive content 7 . Chen et al propose a non‐parametric Bayesian hidden Markov model based on a Dirichlet process for medical record de‐identification.…”
Sensitive data identification is the prerequisite for protecting critical user and business data. Traditional methods usually only target a certain type of application scenario or a certain type of data, thus making it difficult to meet the needs of enterprise‐level data protection. This paper proposes an introduction to the end‐to‐end sensitive data identification system of Beike Inc. The system consists of the data identification & annotation platform, dataset management platform, and sensitive data identification model, which propose different governance methods for batch data and streaming data respectively. Specifically, we propose a sliding window‐based identification method for long text to improve the identification of streaming data. Evaluation results show that this method can improve the effect of identifying long text sensitive data without losing the ability on short text, for the open source test dataset, the value can be up to 94.15, so it is applicable in diverse scenarios.
“…Cumby and Ghani propose a sensitive data recognition technique based on machine learning that utilizes contextual semantic information to identify and detect sensitive content. 7 Chen et al propose a non-parametric Bayesian hidden Markov model based on a Dirichlet process for medical record de-identification. Without manual task-specific feature engineering, the model can perform as accurately as conditional random field (CRF) models in several categories.…”
Section: Data Identification and Desensitizationmentioning
confidence: 99%
“…Cumby and Ghani propose a sensitive data recognition technique based on machine learning that utilizes contextual semantic information to identify and detect sensitive content 7 . Chen et al propose a non‐parametric Bayesian hidden Markov model based on a Dirichlet process for medical record de‐identification.…”
Sensitive data identification is the prerequisite for protecting critical user and business data. Traditional methods usually only target a certain type of application scenario or a certain type of data, thus making it difficult to meet the needs of enterprise‐level data protection. This paper proposes an introduction to the end‐to‐end sensitive data identification system of Beike Inc. The system consists of the data identification & annotation platform, dataset management platform, and sensitive data identification model, which propose different governance methods for batch data and streaming data respectively. Specifically, we propose a sliding window‐based identification method for long text to improve the identification of streaming data. Evaluation results show that this method can improve the effect of identifying long text sensitive data without losing the ability on short text, for the open source test dataset, the value can be up to 94.15, so it is applicable in diverse scenarios.
“…The second type of text anonymization methods relies on on privacy-preserving data publishing (PPDP). In contrast to NLP approaches, PPDP methods (Chakaravarthy et al 2008;Cumby and Ghani 2011;Anandan et al 2012;Batet 2016, 2017) operate with an explicit account of disclosure risk and anonymize documents by enforcing a privacy model. As a result, PPDP approaches are able consider any term that may re-identify a certain entity to protect (a human subject or an organization), either individually for direct identifiers (such as the person's name or a passport) or in aggregate for quasi-identifiers (such as the combination of age, profession and postal code).…”
We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods. Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources, making it difficult to properly evaluate the level of privacy protection offered by various anonymization methods. This paper presents TAB (Text Anonymization Benchmark), a new, open-source annotated corpus developed to address this shortage. The corpus comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations. Compared to previous work, the TAB corpus is designed to go beyond traditional de-identification (which is limited to the detection of predefined semantic categories), and explicitly marks which text spans ought to be masked in order to conceal the identity of the person to be protected.
Along with presenting the corpus and its annotation layers, we also propose a set of evaluation metrics that are specifically tailored towards measuring the performance of text anonymization, both in terms of privacy protection and utility preservation. We illustrate the use of the benchmark and the proposed metrics by assessing the empirical performance of several baseline text anonymization models. The full corpus along with its privacy-oriented annotation guidelines, evaluation scripts and baseline models are available on: https://github.com/NorskRegnesentral/text-anonymization-benchmark.
“…Other works consider the problem of document sanitization and security [26,[38][39][40]). Researchers have developed methods for encoding cryptographic signature schemes into PDF content and analyzing text to find semantically similar content to content marked for redaction.…”
In the past redaction involved the use of black or white markers or paper cut-outs to obscure content on physical paper. Today many redactions take place on digital PDF documents and redaction is often performed by software tools. Typical redaction tools remove text from PDF documents and draw a black or white rectangle in its place, mimicking a physical redaction. This practice is thought to be secure when the redacted text is removed and cannot be "copy-pasted" from the PDF document. We find this common conception is false-existing PDF redactions can be broken by precise measurements of non-redacted character positioning information.We develop a deredaction tool for automatically finding and breaking these vulnerable redactions. We report on 11 different redaction tools, finding the majority do not remove redactionbreaking information, including some Adobe Acrobat workflows. We empirically measure the information leaks, finding some redactions leak upwards of 15 bits of information, creating a 32,768-fold reduction in the space of potential redacted texts. We demonstrate a lower bound on the impact of these leaks via a 22,120 document study, including 18,975 Office of the Inspector General (OIG) investigation reports, where we find 769 vulnerable named-entity redactions. We find leaked information reduces the contents for 164 of these redacted names to less than 494 possibilities from a 7 million name dictionary. We show these findings impact by breaking redactions from the Epstein/Maxwell case, Manafort case, and a released Snowden document. Moreover, we develop an efficient algorithm for locating copy-pastable redactions and find over 100,000 poorly redacted words in US court documents. Current PDF text redaction methods are insufficient for named entity protection.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.