BackgroundThe Flatiron Health-Foundation Medicine Clinico-Genomic Databases (CGDBs) are de-identified, real-world data sources that link comprehensive genomic profiling (CGP) data with clinical data derived from electronic health records (EHRs) for patients with cancer. Comparing the CGDBs to the US population of patients with cancer allows researchers to understand the representativeness of a cohort when designing, conducting, and interpreting their analyses. The objective of this study was to compare the demographic and clinical characteristics of patients in the CGDBs with the Flatiron Health Research Databases (FHRDs) and The National Cancer Institute’s Surveillance, Epidemiology, and End Results (SEER) population-based cancer registry.MethodsWe compared disease-specific CGDBs that had corresponding disease-specific FHRDs with relevant SEER patients using demographic and clinical characteristics of patients with cancer who had documented care from January 1, 2011 to March 31, 2021. For CGDBs where a corresponding disease-specific FHRD does not exist, comparisons were only done against SEER. The SEER Incidence Data 1975-2018 Research Database was used for this analysis, of which patients with a relevant cancer diagnosis from January 1, 2011 to December 31, 2018 were included. Subgroup analyses were performed to address potential biases related to temporal drifts and allow for a more direct comparison of the datasets as well as to examine biases that may be due to data missingness. The impact of the determination to reimburse for next generation sequencing (NGS) testing was not feasible to analyze given the most recent SEER data was available only through the end of 2018 at the time this study was conducted.ResultsThe overall distribution of cancer types was similar between the 22 CGDB databases and SEER. The overall distributions of gender and diagnosis year were similar across all databases. The CGDB has a lower proportion of patients who were aged 80 years or older at initial diagnosis compared to FHRD and SEER cohorts. However, narrower differences were observed in diseases where targeted therapies are approved and comprehensive genomic profiling is indicated (e.g., Melanoma, NSCLC). The proportion of incomplete records for race in the CGDB and FHRD was greater than in SEER. Completeness of stage varied by disease across all 3 cohorts, but was generally lower in CGDB and FHRD for clinical and data model design reasons. Overall the stage distributions for solid tumor cohorts were similar across CGDB and FHRD with SEER tending to have more earlier stage patients, which is expected given differences in data collection methods for the sources.ConclusionThis comparative analysis of real-world, US-based oncology databases provides crucial insights into the similarities and differences in patient characteristics across these three types of data sources. Observed variances could be due to several factors, including differences in CGP testing dynamics and data collection approaches used to create each of the databases. Ongoing monitoring and evaluation of the representativeness of these databases will be critical to help researchers and regulators contextualize evidence from the CGDBs, particularly as the CGDBs are expected to change over time due to increased adoption of CGP as part of routine clinical practice for a growing number of cancers.