Principle-Based Approach for the De-Identification of Code-Mixed Electronic Health Records

Wang, Chen-Kai; Wang, Feng-Duo; Lee, You-Qian; Chen, Pei-Tsz; Wang, Bohong; Su, Chu-Hsien; Kuo, Chin‐Chi; Chien, Yi‐Ling; Dai, Hong-Jie; Tseng, Vincent S.; Hsu, Wen-Lian

doi:10.1109/access.2022.3148396

Cited by 5 publications

(2 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, Tang et al [15] fine-tuned M-BERT on their English-simplified Chinese social media data set for multi-label sentiment analysis, and obtained an F-score of 0.69, which was 15% higher than that for the model without M-BERT. While limited research exists on using BERT-based models or other LLMs for code-mixing deidentification in clinical data sets, our previous work [38] suggested the potential benefits of incorporating M-BERT to disambiguate PHI categories in code-mixed sentences. In this study, we focused on comprehending how BERT-family PLMs handle Chinese-English mixed issues that arise in actual clinical text and assessed the feasibility of using state-of-the-art LLMs to recognize PHI from sampled clinical text.…”

Section: Deidentification Methods and Approaches For Tackling Code-mi...mentioning

confidence: 99%

“…The test set comprised 200 discharge summaries along with 60,632 sentences. Finally, the principle-based resynthesis method proposed in our previous work [38] was used to generate surrogates, and the entire corpus was rechecked by one of the senior annotators (PTC) to ensure a high level of data consistency and correctness.…”

Section: Data Sources and Corpus Constructionmentioning

confidence: 99%

See 1 more Smart Citation

Unlocking the Secrets Behind Advanced Artificial Intelligence Language Models in Deidentifying Chinese-English Mixed Clinical Text: Development and Validation Study

Lee,

Chen,

Chen

et al. 2024

J Med Internet Res

Self Cite

View full text Add to dashboard Cite

Background The widespread use of electronic health records in the clinical and biomedical fields makes the removal of protected health information (PHI) essential to maintain privacy. However, a significant portion of information is recorded in unstructured textual forms, posing a challenge for deidentification. In multilingual countries, medical records could be written in a mixture of more than one language, referred to as code mixing. Most current clinical natural language processing techniques are designed for monolingual text, and there is a need to address the deidentification of code-mixed text. Objective The aim of this study was to investigate the effectiveness and underlying mechanism of fine-tuned pretrained language models (PLMs) in identifying PHI in the code-mixed context. Additionally, we aimed to evaluate the potential of prompting large language models (LLMs) for recognizing PHI in a zero-shot manner. Methods We compiled the first clinical code-mixed deidentification data set consisting of text written in Chinese and English. We explored the effectiveness of fine-tuned PLMs for recognizing PHI in code-mixed content, with a focus on whether PLMs exploit naming regularity and mention coverage to achieve superior performance, by probing the developed models’ outputs to examine their decision-making process. Furthermore, we investigated the potential of prompt-based in-context learning of LLMs for recognizing PHI in code-mixed text. Results The developed methods were evaluated on a code-mixed deidentification corpus of 1700 discharge summaries. We observed that different PHI types had preferences in their occurrences within the different types of language-mixed sentences, and PLMs could effectively recognize PHI by exploiting the learned name regularity. However, the models may exhibit suboptimal results when regularity is weak or mentions contain unknown words that the representations cannot generate well. We also found that the availability of code-mixed training instances is essential for the model’s performance. Furthermore, the LLM-based deidentification method was a feasible and appealing approach that can be controlled and enhanced through natural language prompts. Conclusions The study contributes to understanding the underlying mechanism of PLMs in addressing the deidentification process in the code-mixed context and highlights the significance of incorporating code-mixed training instances into the model training phase. To support the advancement of research, we created a manipulated subset of the resynthesized data set available for research purposes. Based on the compiled data set, we found that the LLM-based deidentification method is a feasible approach, but carefully crafted prompts are essential to avoid unwanted output. However, the use of such methods in the hospital setting requires careful consideration of data security and privacy concerns. Further research could explore the augmentation of PLMs and LLMs with external knowledge to improve their strength in recognizing rare PHI.

show abstract

Section: Deidentification Methods and Approaches For Tackling Code-mi...mentioning

confidence: 99%

Section: Data Sources and Corpus Constructionmentioning

confidence: 99%

Unlocking the Secrets Behind Advanced Artificial Intelligence Language Models in Deidentifying Chinese-English Mixed Clinical Text: Development and Validation Study

Lee,

Chen,

Chen

et al. 2024

J Med Internet Res

Self Cite

View full text Add to dashboard Cite

show abstract

A Framework to Find Single Language Version Using Pattern Analysis in Mixed Script Queries

Chaudhary,

Shekhar

2024

2024 2nd International Conference on Disruptive Technologies (ICDT)

View full text Add to dashboard Cite

Blockchain-Based Federated Learning Technique for Privacy Preservation and Security of Smart Electronic Health Records

Guduri,

Chakraborty,

Maheswari

et al. 2024

IEEE Trans. Consumer Electron.

View full text Add to dashboard Cite

Principle-Based Approach for the De-Identification of Code-Mixed Electronic Health Records

Cited by 5 publications

References 16 publications

Unlocking the Secrets Behind Advanced Artificial Intelligence Language Models in Deidentifying Chinese-English Mixed Clinical Text: Development and Validation Study

Unlocking the Secrets Behind Advanced Artificial Intelligence Language Models in Deidentifying Chinese-English Mixed Clinical Text: Development and Validation Study

A Framework to Find Single Language Version Using Pattern Analysis in Mixed Script Queries

Blockchain-Based Federated Learning Technique for Privacy Preservation and Security of Smart Electronic Health Records

Contact Info

Product

Resources

About