Abstract:The de-identification tool achieves high accuracy when training and test sets are homogeneous (ie, both real or resynthesized records). The resynthesis component regularizes the data to make them less "realistic," resulting in loss of performance particularly when training on resynthesized data and testing on real data.
“…There are variations in resynthesis processes used to replace PHI in corpora [3,11,12]. For PHI involving numerical values such as dates, ids and phone numbers, approaches are usually based on digit replacement strategies.…”
Section: Introductionmentioning
confidence: 99%
“…Approaches for names are less similar. Uzuner et al [3] focused on generating a majority of out-of-vocabulary names, while Yeniterzi et al [11] used names from a dictionary. In their work on Swedish clinical notes, Alfalahi et al [12] used names from dictionaries while also introducing some letter variations to allow for misspelled names, and kept the gender intact in first names.…”
Section: Introductionmentioning
confidence: 99%
“…Yeniterzi et al examined the effect and potential bias that a corpus with surrogate PHI can have on clinical text de-identification [11]. They built a corpus (not shared externally) composed of four classes of clinical narrative texts (laboratory reports, medication orders, discharge summaries and physician letters) and replaced the original PHI with synthetic PHI using the resynthesis engine of the MITRE Identification Scrubber Toolkit [13].…”
Objective
The current study aims to fill the gap in available healthcare de-identification resources by creating a new sharable dataset with realistic Protected Health Information (PHI) without reducing the value of the data for de-identification research. By releasing the annotated gold standard corpus with Data Use Agreement we would like to encourage other Computational Linguists to experiment with our data and develop new machine learning models for de-identification. This paper describes: (1) the modifications required by the Institutional Review Board before sharing the de-identification gold standard corpus; (2) our efforts to keep the PHI as realistic as possible; (3) and the tests to show the effectiveness of these efforts in preserving the value of the modified data set for machine learning model development.
Material and Methods
In a previous study we built an original de-identification gold standard corpus annotated with true Protected Health Information (PHI) from 3,503 randomly selected clinical notes for the 22 most frequent clinical note types of our institution. In the current study we modified the original gold standard corpus to make it suitable for external sharing by replacing HIPAA-specified PHI with newly generated realistic PHI. Finally, we evaluated the research value of this new dataset by comparing the performance of an existing published in-house de-identification system, when trained on the new de-identification gold standard corpus, with the performance of the same system, when trained on the original corpus. We assessed the potential benefits of using the new de-identification gold standard corpus to identify PHI in the i2b2 and PhysioNet datasets that were released by other groups for de-identification research. We also measured the effectiveness of the i2b2 and PhysioNet de-identification gold standard corpora in identifying PHI in our original clinical notes.
Results
Performance of the de-identification system using the new gold standard corpus as a training set was very close to training on the original corpus (92.56 vs. 93.48 overall F-measures). Best i2b2/PhysioNet/CCHMC cross-training performances were obtained when training on the new shared CCHMC gold standard corpus, although performances were still lower than corpus-specific trainings.
Discussion and conclusion
We successfully modified a de-identification dataset for external sharing while preserving the de-identification research value of the modified gold standard corpus with limited drop in machine learning de-identification performance.
“…There are variations in resynthesis processes used to replace PHI in corpora [3,11,12]. For PHI involving numerical values such as dates, ids and phone numbers, approaches are usually based on digit replacement strategies.…”
Section: Introductionmentioning
confidence: 99%
“…Approaches for names are less similar. Uzuner et al [3] focused on generating a majority of out-of-vocabulary names, while Yeniterzi et al [11] used names from a dictionary. In their work on Swedish clinical notes, Alfalahi et al [12] used names from dictionaries while also introducing some letter variations to allow for misspelled names, and kept the gender intact in first names.…”
Section: Introductionmentioning
confidence: 99%
“…Yeniterzi et al examined the effect and potential bias that a corpus with surrogate PHI can have on clinical text de-identification [11]. They built a corpus (not shared externally) composed of four classes of clinical narrative texts (laboratory reports, medication orders, discharge summaries and physician letters) and replaced the original PHI with synthetic PHI using the resynthesis engine of the MITRE Identification Scrubber Toolkit [13].…”
Objective
The current study aims to fill the gap in available healthcare de-identification resources by creating a new sharable dataset with realistic Protected Health Information (PHI) without reducing the value of the data for de-identification research. By releasing the annotated gold standard corpus with Data Use Agreement we would like to encourage other Computational Linguists to experiment with our data and develop new machine learning models for de-identification. This paper describes: (1) the modifications required by the Institutional Review Board before sharing the de-identification gold standard corpus; (2) our efforts to keep the PHI as realistic as possible; (3) and the tests to show the effectiveness of these efforts in preserving the value of the modified data set for machine learning model development.
Material and Methods
In a previous study we built an original de-identification gold standard corpus annotated with true Protected Health Information (PHI) from 3,503 randomly selected clinical notes for the 22 most frequent clinical note types of our institution. In the current study we modified the original gold standard corpus to make it suitable for external sharing by replacing HIPAA-specified PHI with newly generated realistic PHI. Finally, we evaluated the research value of this new dataset by comparing the performance of an existing published in-house de-identification system, when trained on the new de-identification gold standard corpus, with the performance of the same system, when trained on the original corpus. We assessed the potential benefits of using the new de-identification gold standard corpus to identify PHI in the i2b2 and PhysioNet datasets that were released by other groups for de-identification research. We also measured the effectiveness of the i2b2 and PhysioNet de-identification gold standard corpora in identifying PHI in our original clinical notes.
Results
Performance of the de-identification system using the new gold standard corpus as a training set was very close to training on the original corpus (92.56 vs. 93.48 overall F-measures). Best i2b2/PhysioNet/CCHMC cross-training performances were obtained when training on the new shared CCHMC gold standard corpus, although performances were still lower than corpus-specific trainings.
Discussion and conclusion
We successfully modified a de-identification dataset for external sharing while preserving the de-identification research value of the modified gold standard corpus with limited drop in machine learning de-identification performance.
“…Yeniterzi et al [19] evaluated the effectiveness and examined possible bias introduced by re-synthesis on de-identification software. The main research motivation was that real medical records for the development and evaluation of de-identification software are hardly available.…”
Whenever personal data is processed, privacy is a serious issue. Especially in the document-centric e-health area, the patients' privacy must be preserved in order to prevent any negative repercussions for the patient. Clinical research, for example, demands structured health records to carry out efficient clinical trials, whereas legislation (e.g. HIPAA) regulates that only de-identified health records may be used for research. However, unstructured and often paper-based data dominates information technology, especially in the healthcare sector. Existing approaches are geared towards data in English-language documents only and have not been designed to handle the recognition of erroneous personal data which is the result of the OCR-based digitization of paper-based health records.
“…[5], [22]). In [20], pseudonymization is achieved by first separating the identification data from the anamnesis data which is then stored in a separate database referenced with so called unique data identification codes (DIC) as pseudonyms.…”
Federated Health Information Systems (FHIS) integrate autonomous information systems of participating health care providers to facilitate the exchange of Electronic Health Records (EHR), which improve the quality and efficiency of patients' care. However, the main problem with collecting and maintaining the sensitive data in electronic form is the issue of preserving data confidentiality and patients' privacy. Although multiple technical measures to restrict access to only authorized persons are implemented, they are usually aimed against external attackers. In this work, we propose to integrate pseudonymization and encryption to a hybrid approach which not only protects against external attackers, but also ensures that even potential internal attackers with full data access, like administrators, cannot gain any useful information.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.