Protected Health Information Recognition of Unstructured Code-Mixed Electronic Health Records in Taiwan

Lee, You-Qian; Wang, Bohong; Su, Chu-Hsien; Chen, Pei-Tsz; Lin, Wu-Qing; Dai, Hong-Jie

doi:10.3233/shti220153

Cited by 5 publications

(6 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Neural networks are advantageous because they can be initialized with PLMs acquired from extensive unlabeled data, resulting in faster optimization and superior performance. BERT (Bidirectional Encoder Representations from Transformers) pretrained on English corpora (EN-BERT) [28] is one such example of a monolingual transformer model pretrained on the BookCorpus [29] and English Wikipedia in a self-supervised fashion, which has achieved exceptional precision in various NLP tasks including the deidentification task [8,30].…”

Section: Deidentification Methods and Approaches For Tackling Code-mi...mentioning

confidence: 99%

“…1. Unique code-mixed deidentification data set: We significantly extended our original corpus compiled in our previous work [8] by incorporating an additional 900 discharge summaries. Furthermore, we created a manipulated subset of the resynthesized data set available for research purposes.…”

Section: Goal Of This Studymentioning

confidence: 99%

“…From these data, we compiled a code-mixed deidentification corpus of 1700 discharge summaries. Four annotators were enlisted to annotate the entire corpus, and they followed the same annotation procedure as previously described in our work [8]. To ensure the quality of the annotation, the annotation process began with the annotators individually annotating an identical set of 200 randomly sampled records, using the annotation guidelines provided.…”

Section: Data Sources and Corpus Constructionmentioning

confidence: 99%

“…Unfortunately, most existing clinical natural language processing (NLP) techniques are primarily designed for monolingual texts, mainly due to limited research resources. While our previous work [8] has highlighted the potential of pretrained language models (PLMs) in mitigating issues related to code mixing during deidentification, the precise mechanism at play remains incompletely understood. To bridge this knowledge gap, there is a pressing need for a code-mixed deidentification data set that can comprehensively evaluate the performance of state-of-the-art NLP models and unravel their underlying mechanisms.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Unlocking the Secrets Behind Advanced Artificial Intelligence Language Models in Deidentifying Chinese-English Mixed Clinical Text: Development and Validation Study

Lee,

Chen,

Chen

et al. 2024

J Med Internet Res

Self Cite

View full text Add to dashboard Cite

Background The widespread use of electronic health records in the clinical and biomedical fields makes the removal of protected health information (PHI) essential to maintain privacy. However, a significant portion of information is recorded in unstructured textual forms, posing a challenge for deidentification. In multilingual countries, medical records could be written in a mixture of more than one language, referred to as code mixing. Most current clinical natural language processing techniques are designed for monolingual text, and there is a need to address the deidentification of code-mixed text. Objective The aim of this study was to investigate the effectiveness and underlying mechanism of fine-tuned pretrained language models (PLMs) in identifying PHI in the code-mixed context. Additionally, we aimed to evaluate the potential of prompting large language models (LLMs) for recognizing PHI in a zero-shot manner. Methods We compiled the first clinical code-mixed deidentification data set consisting of text written in Chinese and English. We explored the effectiveness of fine-tuned PLMs for recognizing PHI in code-mixed content, with a focus on whether PLMs exploit naming regularity and mention coverage to achieve superior performance, by probing the developed models’ outputs to examine their decision-making process. Furthermore, we investigated the potential of prompt-based in-context learning of LLMs for recognizing PHI in code-mixed text. Results The developed methods were evaluated on a code-mixed deidentification corpus of 1700 discharge summaries. We observed that different PHI types had preferences in their occurrences within the different types of language-mixed sentences, and PLMs could effectively recognize PHI by exploiting the learned name regularity. However, the models may exhibit suboptimal results when regularity is weak or mentions contain unknown words that the representations cannot generate well. We also found that the availability of code-mixed training instances is essential for the model’s performance. Furthermore, the LLM-based deidentification method was a feasible and appealing approach that can be controlled and enhanced through natural language prompts. Conclusions The study contributes to understanding the underlying mechanism of PLMs in addressing the deidentification process in the code-mixed context and highlights the significance of incorporating code-mixed training instances into the model training phase. To support the advancement of research, we created a manipulated subset of the resynthesized data set available for research purposes. Based on the compiled data set, we found that the LLM-based deidentification method is a feasible approach, but carefully crafted prompts are essential to avoid unwanted output. However, the use of such methods in the hospital setting requires careful consideration of data security and privacy concerns. Further research could explore the augmentation of PLMs and LLMs with external knowledge to improve their strength in recognizing rare PHI.

show abstract

Section: Deidentification Methods and Approaches For Tackling Code-mi...mentioning

confidence: 99%

Section: Goal Of This Studymentioning

confidence: 99%

Section: Data Sources and Corpus Constructionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Unlocking the Secrets Behind Advanced Artificial Intelligence Language Models in Deidentifying Chinese-English Mixed Clinical Text: Development and Validation Study

Lee,

Chen,

Chen

et al. 2024

J Med Internet Res

Self Cite

View full text Add to dashboard Cite

show abstract

“…Method learning supervised machine could learn structure complex use training data set and implementing knowledge that for predict results from the situation is not observed [14]. Artificial Neural Networks are from learning supervised machines [15]. Network simulates structure and function system nerves, collect knowledge with detect pattern and relationship between data and learning experience [11].…”

Section: Introductionmentioning

confidence: 99%

Epoch in a neural network for brain stroke

Simanjuntak¹,

Simangunsong²,

Dai³

2023

ijobas

View full text Add to dashboard Cite

A neural network is a data processing system consisting of a large number of simple and highly interconnected processing elements in an architecture inspired by the structure of the cortical regions of the brain. Therefore, neural networks can often do things that humans or animals can do, but traditional computers are often lousy. This research discusses brain tumors that can be detected by artificial intelligence. Stroke includes the sudden death of brain cells due to lack of oxygen, blockage of the circulatory system, or severance of flexible pathways to the brain. Therefore the need for action that must be faster to be able to detect this deadly disease. The method used is a Neural Network which can collect knowledge by detecting patterns and relationships between data and learning experiences. So that the detection process is carried out more quickly and the patient can be given medical action as soon as possible. In the study I conducted brain stroke from the number of strokes with a value of 0 4733 and 1 out of 248. This research has a test conducted by conducting epoch training from 1 to 300, the highest score accuracy is in epoch 1 and 2 with more high scores.

show abstract

De-identification of clinical free text using natural language processing: A systematic review of current approaches

Kovačević,

Bašaragin,

Milošević

et al. 2024

Artificial Intelligence in Medicine

View full text Add to dashboard Cite

Protected Health Information Recognition of Unstructured Code-Mixed Electronic Health Records in Taiwan

Cited by 5 publications

References 0 publications

Unlocking the Secrets Behind Advanced Artificial Intelligence Language Models in Deidentifying Chinese-English Mixed Clinical Text: Development and Validation Study

Unlocking the Secrets Behind Advanced Artificial Intelligence Language Models in Deidentifying Chinese-English Mixed Clinical Text: Development and Validation Study

Epoch in a neural network for brain stroke

De-identification of clinical free text using natural language processing: A systematic review of current approaches

Contact Info

Product

Resources

About