BackgroundAs artificial intelligence (AI) continues to advance with breakthroughs in natural language processing (NLP) and machine learning (ML), such as the development of models like OpenAI’s ChatGPT, new opportunities are emerging for efficient curation of electronic health records (EHR) into real-world data (RWD) for evidence generation in oncology. Our objective is to describe the research and development of industry methods to promote transparency and explainability.MethodsWe applied NLP with ML techniques to train, validate, and test the extraction of information from unstructured documents (eg, clinician notes, radiology reports, lab reports, etc.) to output a set of structured variables required for RWD analysis. This research used a nationwide electronic health record (EHR)-derived database. Models were selected based on performance. Variables curated with an approach using ML extraction are those where the value is determined solely based on an ML model (ie, not confirmed by abstraction), which identifies key information from visit notes and documents. These models do not predict future events or infer missing information.ResultsWe developed an approach using NLP and ML for extraction of clinically meaningful information from unstructured EHR documents and found high performance of output variables compared with variables curated by manually abstracted data. These extraction methods resulted in research-ready variables including initial cancer diagnosis with date, advanced/metastatic diagnosis with date, disease stage, histology, smoking status, surgery status with date, biomarker test results with dates, and oral treatments with dates.ConclusionsNLP and ML enable the extraction of retrospective clinical data in EHR with speed and scalability to help researchers learn from the experience of every person with cancer.
Objective: Machine learning has the potential to increase the scale of real-world data curated from electronic health records, but maintaining a high standard of data quality is important to avoid biasing downstream analyses. To increase scale without compromising quality, we propose a hybrid data curation methodology that employs both manual abstraction by clinical experts and automated extraction by machine learning models. Materials and Methods: Our methodology makes the determination about when to employ manual abstraction using a confidence score associated with each model output. We describe a process for selecting confidence thresholds based on simulations validated against a reference-standard labeled dataset. To establish the fitness of our methodology for retrospective research, we apply it to a multi-variable cohort selection task on a large real-world oncology database. Results: Only small amounts of manual abstraction are required for hybrid curation to achieve expert-level error rates. In fact, the hybrid methodology can even reduce error rates relative to manual abstraction in some cases. We further demonstrate that demographic characteristics of a research cohort defined using hybrid variables are comparable to one curated with conventional methods. Discussion Our methodology is general and makes few assumptions about the clinical variable or machine learning model. A key requirement is the availability of reference standard labels for calibrating the tradeoff between abstraction effort and data quality. Conclusion: Incorporating machine learning into real-world data curation using hybrid methodology holds the promise to scale practicable cohort sizes while maintaining fit for research purposes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.