De-identification of health records using Anonym: Effectiveness and robustness across datasets

Zuccon, Guido; Kotzur, Daniel; Nguyen, Anthony; Bergheim, Anton

doi:10.1016/j.artmed.2014.03.006

Cited by 22 publications

(10 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some are more generalizable than others, and certain methods perform better with some types of PHI than others [71,72]. Recent examples such as MIST [73], BoB [74], Anonym [75], and several systems developed for the i2b2 NLP challenges [76,77], allow for good accuracy and very limited impact on clinical information. [78] Replacing PHI with realistic surrogates [79] and adding biomedical scientific literature text [80] allowed for improved performance.…”

Section: Definitionsmentioning

confidence: 99%

Clinical Data Reuse or Secondary Use: Current Status and Potential Future Progress

Meystre¹,

Lovis²,

Bürkle³

et al. 2017

Yearb Med Inform

155

View full text Add to dashboard Cite

SummaryObjective: To perform a review of recent research in clinical data reuse or secondary use, and envision future advances in this field. Methods: The review is based on a large literature search in MEDLINE (through PubMed), conference proceedings, and the ACM Digital Library, focusing only on research published between 2005 and early 2016. Each selected publication was reviewed by the authors, and a structured analysis and summarization of its content was developed. Results: The initial search produced 359 publications, reduced after a manual examination of abstracts and full publications. The following aspects of clinical data reuse are discussed: motivations and challenges, privacy and ethical concerns, data integration and interoperability, data models and terminologies, unstructured data reuse, structured data mining, clinical practice and research integration, and examples of clinical data reuse (quality measurement and learning healthcare systems). Conclusion: Reuse of clinical data is a fast-growing field recognized as essential to realize the potentials for high quality healthcare, improved healthcare management, reduced healthcare costs, population health management, and effective clinical research.

show abstract

Section: Definitionsmentioning

confidence: 99%

Clinical Data Reuse or Secondary Use: Current Status and Potential Future Progress

Meystre¹,

Lovis²,

Bürkle³

et al. 2017

Yearb Med Inform

155

View full text Add to dashboard Cite

show abstract

“…However, most existing studies on de-identification of clinical text were conducted in a single-institute setting, where the training data and test data were from the same institution. Up until now, there is limited study to explore automated de-identification of clinical notes under cross-institute settings [11–13].…”

Section: Introductionmentioning

confidence: 99%

A study of deep learning methods for de-identification of clinical notes in cross-institute settings

Yang

Lyu

et al. 2019

BMC Med Inform Decis Mak

View full text Add to dashboard Cite

BackgroundDe-identification is a critical technology to facilitate the use of unstructured clinical text while protecting patient privacy and confidentiality. The clinical natural language processing (NLP) community has invested great efforts in developing methods and corpora for de-identification of clinical notes. These annotated corpora are valuable resources for developing automated systems to de-identify clinical text at local hospitals. However, existing studies often utilized training and test data collected from the same institution. There are few studies to explore automated de-identification under cross-institute settings. The goal of this study is to examine deep learning-based de-identification methods at a cross-institute setting, identify the bottlenecks, and provide potential solutions.MethodsWe created a de-identification corpus using a total 500 clinical notes from the University of Florida (UF) Health, developed deep learning-based de-identification models using 2014 i2b2/UTHealth corpus, and evaluated the performance using UF corpus. We compared five different word embeddings trained from the general English text, clinical text, and biomedical literature, explored lexical and linguistic features, and compared two strategies to customize the deep learning models using UF notes and resources.ResultsPre-trained word embeddings using a general English corpus achieved better performance than embeddings from de-identified clinical text and biomedical literature. The performance of deep learning models trained using only i2b2 corpus significantly dropped (strict and relax F1 scores dropped from 0.9547 and 0.9646 to 0.8568 and 0.8958) when applied to another corpus annotated at UF Health. Linguistic features could further improve the performance of de-identification in cross-institute settings. After customizing the models using UF notes and resource, the best model achieved the strict and relaxed F1 scores of 0.9288 and 0.9584, respectively.ConclusionsIt is necessary to customize de-identification models using local clinical text and other resources when applied in cross-institute settings. Fine-tuning is a potential solution to re-use pre-trained parameters and reduce the training time to customize deep learning-based de-identification models trained using clinical corpus from a different institution.

show abstract

“…Even more striking differences may be noticed with respect to definitions of de-identification provided by NIST [7] and Zuccona et al [36]. Similar issues exist for definitions of pseudonymisation in the GDPR [33] and by NIST [7].…”

Section: Terminology and Conceptsmentioning

confidence: 97%

A Data Utility-Driven Benchmark for De-identification Methods

Tomashchuk

Landuyt

Pletea

et al. 2019

Trust, Privacy and Security in Digital Business

View full text Add to dashboard Cite

De-identification is the process of removing the associations between data and identifying elements of individual data subjects. Its main purpose is to allow use of data while preserving the privacy of individual data subjects. It is thus an enabler for compliance with legal regulations such as the EU's General Data Protection Regulation. While many de-identification methods exist, the required knowledge regarding technical implications of different de-identification methods is largely missing. In this paper, we present a data utility-driven benchmark for different de-identification methods. The proposed solution systematically compares de-identification methods while considering their nature, context and de-identified data set goal in order to provide a combination of methods that satisfies privacy requirements while minimizing losses of data utility. The benchmark is validated in a prototype implementation which is applied to a real life data set.

show abstract

De-identification of health records using Anonym: Effectiveness and robustness across datasets

Cited by 22 publications

References 14 publications

Clinical Data Reuse or Secondary Use: Current Status and Potential Future Progress

Clinical Data Reuse or Secondary Use: Current Status and Potential Future Progress

A study of deep learning methods for de-identification of clinical notes in cross-institute settings

A Data Utility-Driven Benchmark for De-identification Methods

Contact Info

Product

Resources

About