A study of deep learning methods for de-identification of clinical notes in cross-institute settings

Yang, Xi; Lyu, Tianchen; Li, Qian; Lee, Chih-Yin; Bian, Jiang; Hogan, William R.; Wu, Yonghui

doi:10.1186/s12911-019-0935-4

Cited by 55 publications

(40 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The model utilizes a CRFs layer to decode the LSTM hidden states to BIO tags. We screened 4 different word embeddings following a similar procedure reported in our previous study [ 46 ] and found that the Common Crawl embeddings—released by Facebook and trained using the fastText on the Common Crawl data set [ 47 ]—achieved better performance compared to other embeddings on a validation data set. Thus, we used the Common Crawl embeddings for all LSTM-CRFs models.…”

Section: Methodsmentioning

confidence: 99%

Extracting Family History of Patients From Clinical Narratives: Exploring an End-to-End Solution With Deep Learning Models

Yang¹,

Zhang²,

He³

et al. 2020

JMIR Med Inform

Self Cite

View full text Add to dashboard Cite

Background Patients’ family history (FH) is a critical risk factor associated with numerous diseases. However, FH information is not well captured in the structured database but often documented in clinical narratives. Natural language processing (NLP) is the key technology to extract patients’ FH from clinical narratives. In 2019, the National NLP Clinical Challenge (n2c2) organized shared tasks to solicit NLP methods for FH information extraction. Objective This study presents our end-to-end FH extraction system developed during the 2019 n2c2 open shared task as well as the new transformer-based models that we developed after the challenge. We seek to develop a machine learning–based solution for FH information extraction without task-specific rules created by hand. Methods We developed deep learning–based systems for FH concept extraction and relation identification. We explored deep learning models including long short-term memory-conditional random fields and bidirectional encoder representations from transformers (BERT) as well as developed ensemble models using a majority voting strategy. To further optimize performance, we systematically compared 3 different strategies to use BERT output representations for relation identification. Results Our system was among the top-ranked systems (3 out of 21) in the challenge. Our best system achieved micro-averaged F1 scores of 0.7944 and 0.6544 for concept extraction and relation identification, respectively. After challenge, we further explored new transformer-based models and improved the performances of both subtasks to 0.8249 and 0.6775, respectively. For relation identification, our system achieved a performance comparable to the best system (0.6810) reported in the challenge. Conclusions This study demonstrated the feasibility of utilizing deep learning methods to extract FH information from clinical narratives.

show abstract

Section: Methodsmentioning

confidence: 99%

Extracting Family History of Patients From Clinical Narratives: Exploring an End-to-End Solution With Deep Learning Models

Yang¹,

Zhang²,

He³

et al. 2020

JMIR Med Inform

Self Cite

View full text Add to dashboard Cite

show abstract

“…We compared two training strategies, including the fine-tuning and the training-from-scratch. For the fine-tuning approach, the deep learning model was first pre-trained using a de-identification dataset curated in the 2014 i2b2 challenge [25] as a base checkpoint. Then, we continuously fine-tuned this checkpoint (i.e., initialize new models with the weights from this checkpoint and use the same model settings) using the local UF datasets (i.e., different number of notes) developed in this study.…”

Section: Models and Training Strategiesmentioning

confidence: 99%

“…We adopted the LSTM-CRFs model developed in our previous works [25,28] using TensorFlow [29]. We trained models using the short-training sets and selected the optimized model checkpoints according to the performances on the validation sets.…”

Section: Experiments and Evaluationmentioning

confidence: 99%

“…There is a cross institute issue when applying the state-of-the-art deep learning models for de-identification. Several studies [25,26] have shown that the fine-tuning strategy was a promising customization approach to enhance the de-identification performances of deep learning-based models in cross institute settings. However, the efficiency of the fine-tuning method was not comprehensively assessed previously.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Customize Deep Learning-based De-Identification Systems Using Local Clinical Notes - A Study of Sample Size

Yang

Bian

2020

Preprint

Self Cite

View full text Add to dashboard Cite

Electronic Health Records (EHRs) are a valuable resource for both clinical and translational research. However, much detailed patient information is embedded in clinical narratives, including a large number of patients' identifiable information. De-identification of clinical notes is a critical technology to protect the privacy and confidentiality of patients. Previous studies presented many automated de-identification systems to capture and remove protected health information from clinical text. However, most of them were tested only in one institute setting where training and test data were from the same institution. Directly adapting these systems without customization could lead to a dramatic performance drop. Recent studies have shown that fine-tuning is a promising method to customize deep learning-based NLP systems across different institutes. However, it's still not clear how much local data is required. In this study, we examined the customizing of a deep learning-based de-identification system using different sizes of local notes from UF Health. Our results showed that the fine-tuning could significantly improve the model performance even on a small local dataset. Yet, when the local data exceeded a threshold (e.g., 700 notes in this study), the performance improvement became marginal.

show abstract

“…De-identification of clinical notes is one of the most crucial prerequisites for utilizing clinical notes in other downstream biomedical informatics studies. Yang et al [5] explored de-identification in cross-institute settings using deep learning-based approaches: fine-tuning and pre-training. They pre-trained de-identification models, LSTM-CRF, on the University of Florida (UF) Health corpus and fine-tuned the models on i2b2 datasets.…”

Section: Topicsmentioning

confidence: 99%

Editorial: The second international workshop on health natural language processing (HealthNLP 2019)

Wang

Uzuner

2019

BMC Med Inform Decis Mak

View full text Add to dashboard Cite

BackgroundIn the past few decades, growing adoption of electronic health record (EHR) systems has made massive clinical narrative data available electronically. Natural language processing (NLP) technologies that can unlock information from narrative text have received great attention in the medical domain. Many clinical NLP methods and systems have been developed and showed promising results in various tasks. These methods and tools have also been successfully applied to facilitate clinical research, as well as to support healthcare applications. Recent advancements in artificial intelligence (AI), particularly deep learning-based neural networks, have achieved state-of-the-art performance on diverse NLP tasks in general domain, indicating great opportunities for solving real-world medical problems. At the same time, the amount of health information available online has exploded through use of social media, community forums, and health-related websites. These present additional challenges and opportunities for further development of new NLP methodologies and applications.

show abstract

A study of deep learning methods for de-identification of clinical notes in cross-institute settings

Cited by 55 publications

References 25 publications

Extracting Family History of Patients From Clinical Narratives: Exploring an End-to-End Solution With Deep Learning Models

Extracting Family History of Patients From Clinical Narratives: Exploring an End-to-End Solution With Deep Learning Models

Customize Deep Learning-based De-Identification Systems Using Local Clinical Notes - A Study of Sample Size

Editorial: The second international workshop on health natural language processing (HealthNLP 2019)

Contact Info

Product

Resources

About