De-identification of medical records using conditional random fields and long short-term memory networks

Jiang, Zhipeng; Zhao, Chao; He, Bin; Guan, Yi; Jiang, Junqiu

doi:10.1016/j.jbi.2017.10.003

Cited by 29 publications

(21 citation statements)

References 29 publications

(47 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To incorporate features from local vocabulary, we utilized a feature embedding layer to incorporate linguistic and knowledge-based features with character and word embeddings [25]. We extracted two most important linguistic features, part-of-speech and word shape, according to previous works [27, 30, 31]. Knowledge-based features are derived from local vocabulary, which is different from the word embeddings that derived from unlabeled clinical text.…”

Section: Methodsmentioning

confidence: 99%

“…Our previous study [25] has proved that the knowledge-based feature embedding layer improved the performance of clinical NER by integrating knowledge features with word embeddings. Chen et al [27] and Jiang et al [30] both showed that the knowledge-based features as complimentary resources to word embeddings improved the performance of identifying PHIs.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

A study of deep learning methods for de-identification of clinical notes in cross-institute settings

Yang

Lyu

et al. 2019

BMC Med Inform Decis Mak

View full text Add to dashboard Cite

BackgroundDe-identification is a critical technology to facilitate the use of unstructured clinical text while protecting patient privacy and confidentiality. The clinical natural language processing (NLP) community has invested great efforts in developing methods and corpora for de-identification of clinical notes. These annotated corpora are valuable resources for developing automated systems to de-identify clinical text at local hospitals. However, existing studies often utilized training and test data collected from the same institution. There are few studies to explore automated de-identification under cross-institute settings. The goal of this study is to examine deep learning-based de-identification methods at a cross-institute setting, identify the bottlenecks, and provide potential solutions.MethodsWe created a de-identification corpus using a total 500 clinical notes from the University of Florida (UF) Health, developed deep learning-based de-identification models using 2014 i2b2/UTHealth corpus, and evaluated the performance using UF corpus. We compared five different word embeddings trained from the general English text, clinical text, and biomedical literature, explored lexical and linguistic features, and compared two strategies to customize the deep learning models using UF notes and resources.ResultsPre-trained word embeddings using a general English corpus achieved better performance than embeddings from de-identified clinical text and biomedical literature. The performance of deep learning models trained using only i2b2 corpus significantly dropped (strict and relax F1 scores dropped from 0.9547 and 0.9646 to 0.8568 and 0.8958) when applied to another corpus annotated at UF Health. Linguistic features could further improve the performance of de-identification in cross-institute settings. After customizing the models using UF notes and resource, the best model achieved the strict and relaxed F1 scores of 0.9288 and 0.9584, respectively.ConclusionsIt is necessary to customize de-identification models using local clinical text and other resources when applied in cross-institute settings. Fine-tuning is a potential solution to re-use pre-trained parameters and reduce the training time to customize deep learning-based de-identification models trained using clinical corpus from a different institution.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

A study of deep learning methods for de-identification of clinical notes in cross-institute settings

Yang

Lyu

et al. 2019

BMC Med Inform Decis Mak

View full text Add to dashboard Cite

show abstract

“…An estimated 80% of all data in EHRs reside in clinical notes [22,23] and are a rich source of data, but their unstructured format makes them complex and difficult to de-identify. Recent methods for identification of the clinical notes have achieved above 90% in accuracy and F1 scores [24][25][26]. However, this does not constitute as fully PHI-free data and poses a barrier for health systems to share data legally.…”

Section: Discussionmentioning

confidence: 99%

Publicly available machine learning models for identifying opioid misuse from the clinical notes of hospitalized patients

Sharma

Dligach

Swope

et al. 2020

BMC Med Inform Decis Mak

View full text Add to dashboard Cite

Background: Automated de-identification methods for removing protected health information (PHI) from the source notes of the electronic health record (EHR) rely on building systems to recognize mentions of PHI in text, but they remain inadequate at ensuring perfect PHI removal. As an alternative to relying on de-identification systems, we propose the following solutions: (1) Mapping the corpus of documents to standardized medical vocabulary (concept unique identifier [CUI] codes mapped from the Unified Medical Language System) thus eliminating PHI as inputs to a machine learning model; and (2) training character-based machine learning models that obviate the need for a dictionary containing input words/n-grams. We aim to test the performance of models with and without PHI in a use-case for an opioid misuse classifier. Methods: An observational cohort sampled from adult hospital inpatient encounters at a health system between 2007 and 2017. A case-control stratified sampling (n = 1000) was performed to build an annotated dataset for a reference standard of cases and non-cases of opioid misuse. Models for training and testing included CUI codes, character-based, and n-gram features. Models applied were machine learning with neural network and logistic regression as well as expert consensus with a rule-based model for opioid misuse. The area under the receiver operating characteristic curves (AUROC) were compared between models for discrimination. The Hosmer-Lemeshow test and visual plots measured model fit and calibration. Results: Machine learning models with CUI codes performed similarly to n-gram models with PHI. The top performing models with AUROCs > 0.90 included CUI codes as inputs to a convolutional neural network, max pooling network, and logistic regression model. The top calibrated models with the best model fit were the CUIbased convolutional neural network and max pooling network. The top weighted CUI codes in logistic regression has the related terms 'Heroin' and 'Victim of abuse'.

show abstract

“…De-identification system Machine learning S1 (Zhao, Zhang, Ma, and Li (2018)), S2 (Chen, Cullen, and Godwin (2015)) S3 (Dernoncourt, Lee, Uzuner, and Szolovits (2017)), S4 (Yadav, Ekbal, Saha, Pathak, and Bhattacharyya (2017)), S5 ), S6 ) Hybrid S7 (Yang and Garibaldi (2015)) S8 (Liu, Tang, Wang, and Chen (2017)) S9 (Lee, Dernoncourt, Uzuner, and Szolovits (2016)) S10 (Dehghan, Kovacevic, Karystianis, Keane, and Nenadic (2015)) S11 (Yang and Garibaldi (2015)) S12 (He, Guan, Cheng, Cen, and Hua (2015)) S13 (Liu, Chen, Tang, Wang, Chen, Li, Wang, Deng, and Zhu (2015)) S14 (Phuong and Chau (2016)) S15 (Bui, Wyatt, and Cimino (2017a)) S16 (Jiang, Zhao, He, Guan, and Jiang (2017)) S17 (Lee, Wu, Zhang, Xu, Xu, and Roberts (2017)) S18 (Shweta, Kumar, Ekbal, Saha, and Bhattacharyya (2016)) In this section, we outline the most significant achievement of automating end-toend de-identification system: improving accuracy. It has been argued that as far as de-identification is concerned, perfection cannot be achieved; however, 95% accuracy is considered to be the rule of thumb and universally accepted value ; ).…”

Section: Architecturementioning

confidence: 99%

A review of Automatic end-to-end De-Identification: Is High Accuracy the Only Metric?

Yogarajan

Pfahringer

Mayo

2020

Applied Artificial Intelligence

View full text Add to dashboard Cite

De-identification of electronic health records (EHR) is a vital step towards advancing health informatics research and maximising the use of available data. It is a two-step process where step one is the identification of protected health information (PHI), and step two is replacing such PHI with surrogates. Despite the recent advances in automatic de-identification of EHR, significant obstacles remain if the abundant health data available are to be used to the full potential. Accuracy in de-identification could be considered a necessary, but not sufficient condition for the use of EHR without individual patient consent. We present here a comprehensive review of the progress to date, both the impressive successes in achieving high accuracy and the significant risks and challenges that remain. To best of our knowledge, this is the first paper to present a complete picture of end-to-end automatic deidentification. We review 18 recently published automatic de-identification systems -designed to de-identify EHR in the form of free text-to show the advancements made in improving the overall accuracy of the system, and in identifying individual PHI. We argue that despite the improvements in accuracy there remain challenges in surrogate generation and replacements of identified PHIs, and the risks posed to patient protection and privacy.

show abstract

De-identification of medical records using conditional random fields and long short-term memory networks

Cited by 29 publications

References 29 publications

A study of deep learning methods for de-identification of clinical notes in cross-institute settings

A study of deep learning methods for de-identification of clinical notes in cross-institute settings

Publicly available machine learning models for identifying opioid misuse from the clinical notes of hospitalized patients

A review of Automatic end-to-end De-Identification: Is High Accuracy the Only Metric?

Contact Info

Product

Resources

About