Considerations for the Use of Machine Learning Extracted Real-World Data to Support Evidence Generation: A Research-Centric Evaluation Framework

Estévez, Melissa; Benedum, Corey M.; Jiang, Chengsheng; Cohen, Aaron B.; Phadke, Sharang; Sarkar, Somnath; Bozkurt, Selen

doi:10.3390/cancers14133063

Cited by 12 publications

(18 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Additionally, there is a need for model transparency and explainability such that model predictions can be trusted by stakeholders and therefore be more readily accepted [ 36 ]. Finally, proper model evaluation is needed to ensure that models are fair and generalizable, which requires an adequate volume of high-quality labeled test data that is not used during model training and validation [ 6 , 37 ].…”

Section: Discussionmentioning

confidence: 99%

“…These sentences are then transformed into a mathematical representation that the model can interpret. Individual models used in this study were evaluated with the research-centric evaluation framework developed by Estevez et al [ 6 ]. Each model’s performance was evaluated using a test set of over 3000 unique lung cancer patients.…”

Section: Methodsmentioning

confidence: 99%

“…How to access and analyze this information at scale for RWE generation is a massive challenge. The standard method of data curation through expert human abstraction is resource-intensive and time-consuming, limiting the number of patients available for research purposes [ 5 , 6 , 7 ]. In response, natural language processing (NLP) with machine learning (ML) techniques (i.e., ML extraction ) is increasingly being applied to EHR data for more efficient and scalable generation of RWD ( Box 1 ).…”

Section: Introductionmentioning

confidence: 99%

“…In response to this gap, we previously developed a research-centric evaluation framework to evaluate ML-extracted RWD and provide insights on model performance, strengths and limitations, and fitness-for-use [ 6 ]. This framework primarily focuses on evaluating a single ML-extracted variable, independent of the output of other ML extraction models.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Replication of Real-World Evidence in Oncology Using Electronic Health Record Data Extracted by Machine Learning

Benedum

Sondhi

Fidyk

et al. 2023

Cancers

Self Cite

View full text Add to dashboard Cite

Meaningful real-world evidence (RWE) generation requires unstructured data found in electronic health records (EHRs) which are often missing from administrative claims; however, obtaining relevant data from unstructured EHR sources is resource-intensive. In response, researchers are using natural language processing (NLP) with machine learning (ML) techniques (i.e., ML extraction) to extract real-world data (RWD) at scale. This study assessed the quality and fitness-for-use of EHR-derived oncology data curated using NLP with ML as compared to the reference standard of expert abstraction. Using a sample of 186,313 patients with lung cancer from a nationwide EHR-derived de-identified database, we performed a series of replication analyses demonstrating some common analyses conducted in retrospective observational research with complex EHR-derived data to generate evidence. Eligible patients were selected into biomarker- and treatment-defined cohorts, first with expert-abstracted then with ML-extracted data. We utilized the biomarker- and treatment-defined cohorts to perform analyses related to biomarker-associated survival and treatment comparative effectiveness, respectively. Across all analyses, the results differed by less than 8% between the data curation methods, and similar conclusions were reached. These results highlight that high-performance ML-extracted variables trained on expert-abstracted data can achieve similar results as when using abstracted data, unlocking the ability to perform oncology research at scale.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Replication of Real-World Evidence in Oncology Using Electronic Health Record Data Extracted by Machine Learning

Benedum

Sondhi

Fidyk

et al. 2023

Cancers

Self Cite

View full text Add to dashboard Cite

show abstract

“…Measuring performance is a complex challenge because even a model with good overall performance might systematically underperform on a particular subcohort of interest, and because while conventional metrics apply to individual models, dozens of ML extracted variables may be combined to answer a specific research question. We use a research-centric evaluation framework 34 to assess the quality of variables curated with ML. Evaluations include one or more of the following strategies: (1) overall performance assessment, (2) stratified performance assessment, and (3) quantitative error analysis, and (4) replication analysis.…”

Section: Model Evaluation and Performance Assessmentmentioning

confidence: 99%

Approach to Machine Learning for Extraction of Real-World Data Variables from Electronic Health Records

Adamson

Waskom

Blarre

et al. 2023

Preprint

Self Cite

View full text Add to dashboard Cite

BackgroundAs artificial intelligence (AI) continues to advance with breakthroughs in natural language processing (NLP) and machine learning (ML), such as the development of models like OpenAI’s ChatGPT, new opportunities are emerging for efficient curation of electronic health records (EHR) into real-world data (RWD) for evidence generation in oncology. Our objective is to describe the research and development of industry methods to promote transparency and explainability.MethodsWe applied NLP with ML techniques to train, validate, and test the extraction of information from unstructured documents (eg, clinician notes, radiology reports, lab reports, etc.) to output a set of structured variables required for RWD analysis. This research used a nationwide electronic health record (EHR)-derived database. Models were selected based on performance. Variables curated with an approach using ML extraction are those where the value is determined solely based on an ML model (ie, not confirmed by abstraction), which identifies key information from visit notes and documents. These models do not predict future events or infer missing information.ResultsWe developed an approach using NLP and ML for extraction of clinically meaningful information from unstructured EHR documents and found high performance of output variables compared with variables curated by manually abstracted data. These extraction methods resulted in research-ready variables including initial cancer diagnosis with date, advanced/metastatic diagnosis with date, disease stage, histology, smoking status, surgery status with date, biomarker test results with dates, and oral treatments with dates.ConclusionsNLP and ML enable the extraction of retrospective clinical data in EHR with speed and scalability to help researchers learn from the experience of every person with cancer.

show abstract

Broadening the HTA of medical AI: A review of the literature to inform a tailored approach

Boverhof,

Redekop,

Visser

et al. 2024

Health Policy and Technology

View full text Add to dashboard Cite

Considerations for the Use of Machine Learning Extracted Real-World Data to Support Evidence Generation: A Research-Centric Evaluation Framework

Cited by 12 publications

References 36 publications

Replication of Real-World Evidence in Oncology Using Electronic Health Record Data Extracted by Machine Learning

Replication of Real-World Evidence in Oncology Using Electronic Health Record Data Extracted by Machine Learning

Approach to Machine Learning for Extraction of Real-World Data Variables from Electronic Health Records

Broadening the HTA of medical AI: A review of the literature to inform a tailored approach

Contact Info

Product

Resources

About