2022
DOI: 10.1109/access.2022.3148396
|View full text |Cite
|
Sign up to set email alerts
|

Principle-Based Approach for the De-Identification of Code-Mixed Electronic Health Records

Abstract: Code-mixing is a phenomenon when at least two languages combined in a hybrid way in the context of a single conversation. The use of mixed language is widespread in multilingual and multicultural countries and poses significant challenges for the development of automated language processing tools. In Taiwan's electronic health record (EHR) systems, the unstructured EHR texts are usually represented in the mixing of English and Chinese languages resulting in the difficulty for de-identification and synthetizati… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
2
1

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(2 citation statements)
references
References 16 publications
0
2
0
Order By: Relevance
“…For example, Tang et al [15] fine-tuned M-BERT on their English-simplified Chinese social media data set for multi-label sentiment analysis, and obtained an F-score of 0.69, which was 15% higher than that for the model without M-BERT. While limited research exists on using BERT-based models or other LLMs for code-mixing deidentification in clinical data sets, our previous work [38] suggested the potential benefits of incorporating M-BERT to disambiguate PHI categories in code-mixed sentences. In this study, we focused on comprehending how BERT-family PLMs handle Chinese-English mixed issues that arise in actual clinical text and assessed the feasibility of using state-of-the-art LLMs to recognize PHI from sampled clinical text.…”
Section: Deidentification Methods and Approaches For Tackling Code-mi...mentioning
confidence: 99%
See 1 more Smart Citation
“…For example, Tang et al [15] fine-tuned M-BERT on their English-simplified Chinese social media data set for multi-label sentiment analysis, and obtained an F-score of 0.69, which was 15% higher than that for the model without M-BERT. While limited research exists on using BERT-based models or other LLMs for code-mixing deidentification in clinical data sets, our previous work [38] suggested the potential benefits of incorporating M-BERT to disambiguate PHI categories in code-mixed sentences. In this study, we focused on comprehending how BERT-family PLMs handle Chinese-English mixed issues that arise in actual clinical text and assessed the feasibility of using state-of-the-art LLMs to recognize PHI from sampled clinical text.…”
Section: Deidentification Methods and Approaches For Tackling Code-mi...mentioning
confidence: 99%
“…The test set comprised 200 discharge summaries along with 60,632 sentences. Finally, the principle-based resynthesis method proposed in our previous work [38] was used to generate surrogates, and the entire corpus was rechecked by one of the senior annotators (PTC) to ensure a high level of data consistency and correctness.…”
Section: Data Sources and Corpus Constructionmentioning
confidence: 99%