Effects of personal identifier resynthesis on clinical text de-identification

Yeniterzi, Reyyan; Aberdeen, John S.; Bayer, Samuel; Wellner, Benjamin; Hirschman, Lynette; Malin, Bradley

doi:10.1136/jamia.2009.002212

Cited by 41 publications

(30 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…There are variations in resynthesis processes used to replace PHI in corpora [3,11,12]. For PHI involving numerical values such as dates, ids and phone numbers, approaches are usually based on digit replacement strategies.…”

Section: Introductionmentioning

confidence: 99%

“…Approaches for names are less similar. Uzuner et al [3] focused on generating a majority of out-of-vocabulary names, while Yeniterzi et al [11] used names from a dictionary. In their work on Swedish clinical notes, Alfalahi et al [12] used names from dictionaries while also introducing some letter variations to allow for misspelled names, and kept the gender intact in first names.…”

Section: Introductionmentioning

confidence: 99%

“…Yeniterzi et al examined the effect and potential bias that a corpus with surrogate PHI can have on clinical text de-identification [11]. They built a corpus (not shared externally) composed of four classes of clinical narrative texts (laboratory reports, medication orders, discharge summaries and physician letters) and replaced the original PHI with synthetic PHI using the resynthesis engine of the MITRE Identification Scrubber Toolkit [13].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research

Deléger

Lingren

et al. 2014

Journal of Biomedical Informatics

View full text Add to dashboard Cite

Objective The current study aims to fill the gap in available healthcare de-identification resources by creating a new sharable dataset with realistic Protected Health Information (PHI) without reducing the value of the data for de-identification research. By releasing the annotated gold standard corpus with Data Use Agreement we would like to encourage other Computational Linguists to experiment with our data and develop new machine learning models for de-identification. This paper describes: (1) the modifications required by the Institutional Review Board before sharing the de-identification gold standard corpus; (2) our efforts to keep the PHI as realistic as possible; (3) and the tests to show the effectiveness of these efforts in preserving the value of the modified data set for machine learning model development. Material and Methods In a previous study we built an original de-identification gold standard corpus annotated with true Protected Health Information (PHI) from 3,503 randomly selected clinical notes for the 22 most frequent clinical note types of our institution. In the current study we modified the original gold standard corpus to make it suitable for external sharing by replacing HIPAA-specified PHI with newly generated realistic PHI. Finally, we evaluated the research value of this new dataset by comparing the performance of an existing published in-house de-identification system, when trained on the new de-identification gold standard corpus, with the performance of the same system, when trained on the original corpus. We assessed the potential benefits of using the new de-identification gold standard corpus to identify PHI in the i2b2 and PhysioNet datasets that were released by other groups for de-identification research. We also measured the effectiveness of the i2b2 and PhysioNet de-identification gold standard corpora in identifying PHI in our original clinical notes. Results Performance of the de-identification system using the new gold standard corpus as a training set was very close to training on the original corpus (92.56 vs. 93.48 overall F-measures). Best i2b2/PhysioNet/CCHMC cross-training performances were obtained when training on the new shared CCHMC gold standard corpus, although performances were still lower than corpus-specific trainings. Discussion and conclusion We successfully modified a de-identification dataset for external sharing while preserving the de-identification research value of the modified gold standard corpus with limited drop in machine learning de-identification performance.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research

Deléger

Lingren

et al. 2014

Journal of Biomedical Informatics

View full text Add to dashboard Cite

show abstract

“…Yeniterzi et al [19] evaluated the effectiveness and examined possible bias introduced by re-synthesis on de-identification software. The main research motivation was that real medical records for the development and evaluation of de-identification software are hardly available.…”

Section: Related Workmentioning

confidence: 99%

De-identification of unstructured paper-based health records for privacy-preserving secondary use

Fenz

Heurix

Neubauer

et al. 2014

Journal of Medical Engineering & Technology

View full text Add to dashboard Cite

Whenever personal data is processed, privacy is a serious issue. Especially in the document-centric e-health area, the patients' privacy must be preserved in order to prevent any negative repercussions for the patient. Clinical research, for example, demands structured health records to carry out efficient clinical trials, whereas legislation (e.g. HIPAA) regulates that only de-identified health records may be used for research. However, unstructured and often paper-based data dominates information technology, especially in the healthcare sector. Existing approaches are geared towards data in English-language documents only and have not been designed to handle the recognition of erroneous personal data which is the result of the OCR-based digitization of paper-based health records.

show abstract

“…[5], [22]). In [20], pseudonymization is achieved by first separating the identification data from the anamnesis data which is then stored in a separate database referenced with so called unique data identification codes (DIC) as pseudonyms.…”

Section: Pseudonymizationmentioning

confidence: 99%

A Hybrid Approach Integrating Encryption and Pseudonymization for Protecting Electronic Health Records

Heurix

Karlinger

Schrefl

et al. 2010

Biomedical Engineering

View full text Add to dashboard Cite

Federated Health Information Systems (FHIS) integrate autonomous information systems of participating health care providers to facilitate the exchange of Electronic Health Records (EHR), which improve the quality and efficiency of patients' care. However, the main problem with collecting and maintaining the sensitive data in electronic form is the issue of preserving data confidentiality and patients' privacy. Although multiple technical measures to restrict access to only authorized persons are implemented, they are usually aimed against external attackers. In this work, we propose to integrate pseudonymization and encryption to a hybrid approach which not only protects against external attackers, but also ensures that even potential internal attackers with full data access, like administrators, cannot gain any useful information.

show abstract

Effects of personal identifier resynthesis on clinical text de-identification

Cited by 41 publications

References 15 publications

Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research

Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research

De-identification of unstructured paper-based health records for privacy-preserving secondary use

A Hybrid Approach Integrating Encryption and Pseudonymization for Protecting Electronic Health Records

Contact Info

Product

Resources

About