Ppm8 a Machine Learning Model for Cancer Biomarker Identification in Electronic Health Records

Ambwani, G.; Cohen, Aaron B.; Estévez, Melissa; Singh, Nitika; Adamson, Blythe; Nussbaum, Nathan; Birnbaum, Ben

doi:10.1016/j.jval.2019.04.1631

Cited by 5 publications

(4 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A separate part of the model is able to extract from the document the date a result was returned and the biomarker result. Early efforts with a regularized logistic regression model were presented previously 40 and more sophisticated models have been developed since.…”

Section: Resultsmentioning

confidence: 99%

Approach to Machine Learning for Extraction of Real-World Data Variables from Electronic Health Records

Adamson

Waskom

Blarre

et al. 2023

Preprint

View full text Add to dashboard Cite

BackgroundAs artificial intelligence (AI) continues to advance with breakthroughs in natural language processing (NLP) and machine learning (ML), such as the development of models like OpenAI’s ChatGPT, new opportunities are emerging for efficient curation of electronic health records (EHR) into real-world data (RWD) for evidence generation in oncology. Our objective is to describe the research and development of industry methods to promote transparency and explainability.MethodsWe applied NLP with ML techniques to train, validate, and test the extraction of information from unstructured documents (eg, clinician notes, radiology reports, lab reports, etc.) to output a set of structured variables required for RWD analysis. This research used a nationwide electronic health record (EHR)-derived database. Models were selected based on performance. Variables curated with an approach using ML extraction are those where the value is determined solely based on an ML model (ie, not confirmed by abstraction), which identifies key information from visit notes and documents. These models do not predict future events or infer missing information.ResultsWe developed an approach using NLP and ML for extraction of clinically meaningful information from unstructured EHR documents and found high performance of output variables compared with variables curated by manually abstracted data. These extraction methods resulted in research-ready variables including initial cancer diagnosis with date, advanced/metastatic diagnosis with date, disease stage, histology, smoking status, surgery status with date, biomarker test results with dates, and oral treatments with dates.ConclusionsNLP and ML enable the extraction of retrospective clinical data in EHR with speed and scalability to help researchers learn from the experience of every person with cancer.

show abstract

Section: Resultsmentioning

confidence: 99%

Approach to Machine Learning for Extraction of Real-World Data Variables from Electronic Health Records

Adamson

Waskom

Blarre

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…A separate part of the model is able to extract from the document the date a result was returned and the biomarker result. Early efforts with a regularized logistic regression model were presented previously ( Ambwani et al, 2019 ) and more sophisticated models have been developed since.…”

Section: Resultsmentioning

confidence: 99%

Approach to machine learning for extraction of real-world data variables from electronic health records

Adamson,

Waskom,

Blarre

et al. 2023

Front. Pharmacol.

View full text Add to dashboard Cite

Background: As artificial intelligence (AI) continues to advance with breakthroughs in natural language processing (NLP) and machine learning (ML), such as the development of models like OpenAI’s ChatGPT, new opportunities are emerging for efficient curation of electronic health records (EHR) into real-world data (RWD) for evidence generation in oncology. Our objective is to describe the research and development of industry methods to promote transparency and explainability.Methods: We applied NLP with ML techniques to train, validate, and test the extraction of information from unstructured documents (e.g., clinician notes, radiology reports, lab reports, etc.) to output a set of structured variables required for RWD analysis. This research used a nationwide electronic health record (EHR)-derived database. Models were selected based on performance. Variables curated with an approach using ML extraction are those where the value is determined solely based on an ML model (i.e. not confirmed by abstraction), which identifies key information from visit notes and documents. These models do not predict future events or infer missing information.Results: We developed an approach using NLP and ML for extraction of clinically meaningful information from unstructured EHR documents and found high performance of output variables compared with variables curated by manually abstracted data. These extraction methods resulted in research-ready variables including initial cancer diagnosis with date, advanced/metastatic diagnosis with date, disease stage, histology, smoking status, surgery status with date, biomarker test results with dates, and oral treatments with dates.Conclusion: NLP and ML enable the extraction of retrospective clinical data in EHR with speed and scalability to help researchers learn from the experience of every person with cancer.

show abstract

“…(3) Real-world clinico-genomic data can also be used to train and validate machine learning algorithms, identifying new complex molecular signatures that may inform clinical decision making. (4)(5)(6) Whether the data are reflective of the target population defined in a specific application is termed the representativeness of a dataset (i.e., the closeness with which sampled patients from a setting of interest align with the patient population at large in terms of relevant demographic and clinical characteristics). (7) In cases of imperfect representativeness, individual analytic conclusions can only be appraised if the sources and degree of non-representativeness are understood, documented, and clearly communicated.…”

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Comparison of Population Characteristics in Real-World Clinical Oncology Databases in the US: Flatiron Health-Foundation Medicine Clinico-Genomic Databases, Flatiron Health Research Databases, and the National Cancer Institute SEER Population-Based Cancer Registry

Snow

Snider

Comment³

et al. 2023

Preprint

View full text Add to dashboard Cite

BackgroundThe Flatiron Health-Foundation Medicine Clinico-Genomic Databases (CGDBs) are de-identified, real-world data sources that link comprehensive genomic profiling (CGP) data with clinical data derived from electronic health records (EHRs) for patients with cancer. Comparing the CGDBs to the US population of patients with cancer allows researchers to understand the representativeness of a cohort when designing, conducting, and interpreting their analyses. The objective of this study was to compare the demographic and clinical characteristics of patients in the CGDBs with the Flatiron Health Research Databases (FHRDs) and The National Cancer Institute’s Surveillance, Epidemiology, and End Results (SEER) population-based cancer registry.MethodsWe compared disease-specific CGDBs that had corresponding disease-specific FHRDs with relevant SEER patients using demographic and clinical characteristics of patients with cancer who had documented care from January 1, 2011 to March 31, 2021. For CGDBs where a corresponding disease-specific FHRD does not exist, comparisons were only done against SEER. The SEER Incidence Data 1975-2018 Research Database was used for this analysis, of which patients with a relevant cancer diagnosis from January 1, 2011 to December 31, 2018 were included. Subgroup analyses were performed to address potential biases related to temporal drifts and allow for a more direct comparison of the datasets as well as to examine biases that may be due to data missingness. The impact of the determination to reimburse for next generation sequencing (NGS) testing was not feasible to analyze given the most recent SEER data was available only through the end of 2018 at the time this study was conducted.ResultsThe overall distribution of cancer types was similar between the 22 CGDB databases and SEER. The overall distributions of gender and diagnosis year were similar across all databases. The CGDB has a lower proportion of patients who were aged 80 years or older at initial diagnosis compared to FHRD and SEER cohorts. However, narrower differences were observed in diseases where targeted therapies are approved and comprehensive genomic profiling is indicated (e.g., Melanoma, NSCLC). The proportion of incomplete records for race in the CGDB and FHRD was greater than in SEER. Completeness of stage varied by disease across all 3 cohorts, but was generally lower in CGDB and FHRD for clinical and data model design reasons. Overall the stage distributions for solid tumor cohorts were similar across CGDB and FHRD with SEER tending to have more earlier stage patients, which is expected given differences in data collection methods for the sources.ConclusionThis comparative analysis of real-world, US-based oncology databases provides crucial insights into the similarities and differences in patient characteristics across these three types of data sources. Observed variances could be due to several factors, including differences in CGP testing dynamics and data collection approaches used to create each of the databases. Ongoing monitoring and evaluation of the representativeness of these databases will be critical to help researchers and regulators contextualize evidence from the CGDBs, particularly as the CGDBs are expected to change over time due to increased adoption of CGP as part of routine clinical practice for a growing number of cancers.

show abstract

Ppm8 a Machine Learning Model for Cancer Biomarker Identification in Electronic Health Records

Abstract: to make reproductive and healthcare decisions. Screening for breast/ovarian cancer in older women may offer lower value in isolation, but its cost-effectiveness should be assessed within the context of a broader screening panel for other diseases.

Cited by 5 publications

References 0 publications

Approach to Machine Learning for Extraction of Real-World Data Variables from Electronic Health Records

Approach to Machine Learning for Extraction of Real-World Data Variables from Electronic Health Records

Approach to machine learning for extraction of real-world data variables from electronic health records

Comparison of Population Characteristics in Real-World Clinical Oncology Databases in the US: Flatiron Health-Foundation Medicine Clinico-Genomic Databases, Flatiron Health Research Databases, and the National Cancer Institute SEER Population-Based Cancer Registry

Contact Info

Product

Resources

About