A Machine Learning Based System for Semi-Automatically Redacting Documents

Cumby, Chad; Ghani, Rayid

doi:10.1609/aaai.v25i2.18851

Cited by 31 publications

(7 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Cumby and Ghani propose a sensitive data recognition technique based on machine learning that utilizes contextual semantic information to identify and detect sensitive content. 7 Chen et al propose a non-parametric Bayesian hidden Markov model based on a Dirichlet process for medical record de-identification. Without manual task-specific feature engineering, the model can perform as accurately as conditional random field (CRF) models in several categories.…”

Section: Data Identification and Desensitizationmentioning

confidence: 99%

See 1 more Smart Citation

Sensitive data identification for multi‐category and multi‐scenario data

Cui,

Huang,

Bai

et al. 2024

Trans Emerging Tel Tech

View full text Add to dashboard Cite

Sensitive data identification is the prerequisite for protecting critical user and business data. Traditional methods usually only target a certain type of application scenario or a certain type of data, thus making it difficult to meet the needs of enterprise‐level data protection. This paper proposes an introduction to the end‐to‐end sensitive data identification system of Beike Inc. The system consists of the data identification & annotation platform, dataset management platform, and sensitive data identification model, which propose different governance methods for batch data and streaming data respectively. Specifically, we propose a sliding window‐based identification method for long text to improve the identification of streaming data. Evaluation results show that this method can improve the effect of identifying long text sensitive data without losing the ability on short text, for the open source test dataset, the value can be up to 94.15, so it is applicable in diverse scenarios.

show abstract

Section: Data Identification and Desensitizationmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Sensitive data identification for multi‐category and multi‐scenario data

Cui,

Huang,

Bai

et al. 2024

Trans Emerging Tel Tech

View full text Add to dashboard Cite

show abstract

“…The second type of text anonymization methods relies on on privacy-preserving data publishing (PPDP). In contrast to NLP approaches, PPDP methods (Chakaravarthy et al 2008;Cumby and Ghani 2011;Anandan et al 2012;Batet 2016, 2017) operate with an explicit account of disclosure risk and anonymize documents by enforcing a privacy model. As a result, PPDP approaches are able consider any term that may re-identify a certain entity to protect (a human subject or an organization), either individually for direct identifiers (such as the person's name or a passport) or in aggregate for quasi-identifiers (such as the combination of age, profession and postal code).…”

Section: Text Anonymization Techniquesmentioning

confidence: 99%

The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization

Pilán

Lison

Øvrelid

et al. 2022

Computational Linguistics

View full text Add to dashboard Cite

We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods. Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources, making it difficult to properly evaluate the level of privacy protection offered by various anonymization methods. This paper presents TAB (Text Anonymization Benchmark), a new, open-source annotated corpus developed to address this shortage. The corpus comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations. Compared to previous work, the TAB corpus is designed to go beyond traditional de-identification (which is limited to the detection of predefined semantic categories), and explicitly marks which text spans ought to be masked in order to conceal the identity of the person to be protected. Along with presenting the corpus and its annotation layers, we also propose a set of evaluation metrics that are specifically tailored towards measuring the performance of text anonymization, both in terms of privacy protection and utility preservation. We illustrate the use of the benchmark and the proposed metrics by assessing the empirical performance of several baseline text anonymization models. The full corpus along with its privacy-oriented annotation guidelines, evaluation scripts and baseline models are available on: https://github.com/NorskRegnesentral/text-anonymization-benchmark.

show abstract

“…Other works consider the problem of document sanitization and security [26,[38][39][40]). Researchers have developed methods for encoding cryptographic signature schemes into PDF content and analyzing text to find semantically similar content to content marked for redaction.…”

Section: Related Workmentioning

confidence: 99%

Story Beyond the Eye: Glyph Positions Break PDF Text Redaction

Bland¹,

Iyer²,

Levchenko³

2022

Preprint

View full text Add to dashboard Cite

In the past redaction involved the use of black or white markers or paper cut-outs to obscure content on physical paper. Today many redactions take place on digital PDF documents and redaction is often performed by software tools. Typical redaction tools remove text from PDF documents and draw a black or white rectangle in its place, mimicking a physical redaction. This practice is thought to be secure when the redacted text is removed and cannot be "copy-pasted" from the PDF document. We find this common conception is false-existing PDF redactions can be broken by precise measurements of non-redacted character positioning information.We develop a deredaction tool for automatically finding and breaking these vulnerable redactions. We report on 11 different redaction tools, finding the majority do not remove redactionbreaking information, including some Adobe Acrobat workflows. We empirically measure the information leaks, finding some redactions leak upwards of 15 bits of information, creating a 32,768-fold reduction in the space of potential redacted texts. We demonstrate a lower bound on the impact of these leaks via a 22,120 document study, including 18,975 Office of the Inspector General (OIG) investigation reports, where we find 769 vulnerable named-entity redactions. We find leaked information reduces the contents for 164 of these redacted names to less than 494 possibilities from a 7 million name dictionary. We show these findings impact by breaking redactions from the Epstein/Maxwell case, Manafort case, and a released Snowden document. Moreover, we develop an efficient algorithm for locating copy-pastable redactions and find over 100,000 poorly redacted words in US court documents. Current PDF text redaction methods are insufficient for named entity protection.

show abstract

A Machine Learning Based System for Semi-Automatically Redacting Documents

Cited by 31 publications

References 13 publications

Sensitive data identification for multi‐category and multi‐scenario data

Sensitive data identification for multi‐category and multi‐scenario data

The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization

Story Beyond the Eye: Glyph Positions Break PDF Text Redaction

Contact Info

Product

Resources

About