2022
DOI: 10.2196/32903
|View full text |Cite
|
Sign up to set email alerts
|

Strategies to Address the Lack of Labeled Data for Supervised Machine Learning Training With Electronic Health Records: Case Study for the Extraction of Symptoms From Clinical Notes

Abstract: Background Automated extraction of symptoms from clinical notes is a challenging task owing to the multidimensional nature of symptom description. The availability of labeled training data is extremely limited owing to the nature of the data containing protected health information. Natural language processing and machine learning to process clinical text for such a task have great potential. However, supervised machine learning requires a great amount of labeled data to train a model, which is at t… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 15 publications
(4 citation statements)
references
References 49 publications
0
4
0
Order By: Relevance
“…Prior studies have explored methods for addressing the challenge of obtaining sufficient data for training [ 16 ]. To acquire clinical notes for labeling that are more likely to exhibit a minority risk factor, we used unsupervised semantic textual similarity (STS).…”
Section: Methodsmentioning
confidence: 99%
“…Prior studies have explored methods for addressing the challenge of obtaining sufficient data for training [ 16 ]. To acquire clinical notes for labeling that are more likely to exhibit a minority risk factor, we used unsupervised semantic textual similarity (STS).…”
Section: Methodsmentioning
confidence: 99%
“…As was highlighted in the introduction, the scarcity of adequately labelled data for model training is recognized as a prevailing challenge in the medical field 37 . This challenge is notably apparent in the visual modality, even though numerous dependable medical language models recently have been introduced 38,39 .…”
Section: Methodsmentioning
confidence: 99%
“…Fries et al 9 utilized data programming with BioBERT to classify medical entities and demonstrated comparable results to fully supervised models on multiple benchmark datasets. Very recently, Humbert-Droz et al 27 developed a data programming-based weak supervision pipeline using Snorkel to generate weak labels for identifying the presence or absence of symptoms. Moreover, in the biomedical domain, multiple studies have used the Snorkel framework for extracting chemical reaction relationships from biomedical abstracts, 28 biomedical relation extraction, 29 and filtering biomedical research articles as relevant or nonrelevant for drug repurposing in cancer.…”
Section: Related Workmentioning
confidence: 99%