In this paper, we demonstrate the use of a "Challenge Dataset": a small, site-specific, manually curated dataset - enriched with uncommon, risk-exposing, and clinically important edge cases - that can facilitate pre-deployment evaluation and identification of clinically relevant AI performance deficits. The five major steps of the Challenge Dataset process are described in detail, including defining use cases, edge case selection, dataset size determination, dataset compilation, and model evaluation. Evaluating performance of four chest X-ray classifiers (one third-party developer model and three models trained on open-source datasets) on a small, manually curated dataset (410 images), we observe a generalization gap of 20.7% (13.5% - 29.1%) for sensitivity and 10.5% (4.3% - 18.3%) for specificity compared to developer-reported values. Performance decreases further when evaluated against edge cases (critical findings: 43.4% [27.4% - 59.8%]; unusual findings: 45.9% [23.1% - 68.7%]; solitary findings 45.9% [23.1% - 68.7%]). Expert manual audit revealed examples of critical model failure (e.g., missed pneumomediastinum) with potential for patient harm. As a measure of effort, we find that the minimum required number of Challenge Dataset cases is about 1% of the annual total for our site (approximately 400 of 40,000). Overall, we find that the Challenge Dataset process provides a method for local pre-deployment evaluation of medical imaging AI models, allowing imaging providers to identify both deficits in model generalizability and specific points of failure prior to clinical deployment.
e15581 Background: The randomized phase II CCTG CO.26 clinical trial investigated the use of combined durvalumab and tremelimumab vs. best supportive care (BSC) for patients with mCRC and suggested an increase in overall survival (OS). The largest benefit was seen in patients who were microsatellite stable (MSS) with a pTMB ≥ 28 variants per megabase. Considering significantly higher adverse event rates and costs associated with durvalumab and tremelimumab, it is important to evaluate its cost-effectiveness. Accordingly, we performed a cost-utility analysis of durvalumab and tremelimumab compared to BSC in the intention-to-treat (ITT) and biomarker-enriched populations using CO.26 trial data. Methods: We developed a 4-state microsimulation model to evaluate the expected health outcomes in life-years (LYs), quality-adjusted life-years (QALYs) and costs of the treatment group compared to BSC over a lifetime horizon (5 years). The incremental cost-utility ratio (ICUR) was used to compare treatment strategies. Direct trial data from CO.26 were used to inform model inputs, including OS curves, progression-free survival (PFS) curves, and adverse event rates. As health state utilities were not collected in CO.26, values from the CORRECT trial, a multi-centre randomized placebo-controlled phase III study for regorafenib in mCRC, were used. Costs of therapy, hospitalization due to adverse events, end-of-life care, and physician costs were derived from the literature and publicly available sources (in 2020 Canadian dollars). Since the monthly price of tremelimumab was unavailable, it was approximated with the price of another CTLA-4 inhibitor, ipilimumab. The base-case analysis evaluated these treatment strategies in the ITT population. Scenario analyses evaluated the cost-effectiveness in biomarker-enriched populations. Costs and effects were discounted at 1.5% as per Canadian guidelines. Results: In the base-case, expected LYs for combined durvalumab and tremelimumab and BSC were 0.75 and 0.51 (incremental (Δ) 0.24) respectively. Expected QALYs were 0.47 and 0.33 (Δ 0.14). Expected lifetime costs were $60 500 and $15 500 (Δ $45 000) for an ICUR of $320 000/QALY. In the biomarker-enriched subgroup, the expected LYs were 0.67 and 0.33 (Δ 0.34), expected QALYs were 0.43 and 0.22 (Δ 0.21), and expected lifetime costs were $62 000 and $15 200 (Δ $47 000). This represents an increase in the incremental QALYs by 50% and costs by 5% for an ICUR 30% lower than the base case at $220 000/QALY. Conclusions: Combined durvalumab and tremelimumab is not considered cost-effective in refractory mCRC under conventional willingness-to-pay thresholds. Cost-effectiveness is improved with biomarker enrichment for high pTMB, driven by the greater derived health outcomes in this subgroup.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.