Diagnostic AI systems trained using deep learning have been shown to achieve expert-level identi cation of diseases in multiple medical imaging settings 1,2 . However, such systems are not always reliable and can fail in cases diagnosed accurately by clinicians and vice versa 3 . Mechanisms for leveraging this complementarity by learning to select optimally between discordant decisions of AIs and clinicians have remained largely unexplored in healthcare 4 , yet have the potential to achieve levels of performance that exceed that possible from either AI or clinician alone 4 .We develop a Complementarity-driven Deferral-to-Clinical Work ow (CoDoC) system that can learn to decide when to rely on a diagnostic AI model and when to defer to a clinician or their work ow. We show that our system is compatible with diagnostic AI models from multiple manufacturers, obtaining enhanced accuracy (sensitivity and/or speci city) relative to clinician-only or AI-only baselines in clinical work ows that screen for breast cancer or tuberculosis. For breast cancer, we demonstrate the rst system that exceeds the accuracy of double-reading with arbitration (the "gold standard" of care) in a large representative UK screening program, with 25% reduction in false positives despite equivalent truepositive detection, while achieving a 66% reduction in clinical workload. In two separate US datasets, CoDoC exceeds the accuracy of single-reading by board certi ed radiologists and two different standalone state-of-the-art AI systems, with generalisation of this nding in different diagnostic AI manufacturers. For TB screening with chest X-rays, CoDoC improved speci city (while maintaining sensitivity) compared to standalone AI or clinicians for 3 of 5 commercially available diagnostic AI systems (5-15% reduction in false positives). Further, we show the limits of con dence score based deferral systems for medical AI, by demonstrating that no deferral strategy could have achieved signi cant improvement on the remaining two diagnostic AI systems.Our comprehensive assessment demonstrates that the superiority of CoDoC is sustained in multiple realistic stress tests for generalisation of medical AI tools along four axes: variation in the medical imaging modality; variation in clinical settings and human experts; different clinical deferral pathways within a given modality; and different AI softwares. Further, given the simplicity of CoDoC we believe that practitioners can easily adapt it and we provide an open-source implementation to encourage widespread further research and application.
Diagnostic AI systems trained using deep learning have been shown to achieve expert-level identification of diseases in multiple medical imaging settings1,2. However, such systems are not always reliable and can fail in cases diagnosed accurately by clinicians and vice versa3. Mechanisms for leveraging this complementarity by learning to select optimally between discordant decisions of AIs and clinicians have remained largely unexplored in healthcare4, yet have the potential to achieve levels of performance that exceed that possible from either AI or clinician alone4. We develop a Complementarity-driven Deferral-to-Clinical Workflow (CoDoC) system that can learn to decide when to rely on a diagnostic AI model and when to defer to a clinician or their workflow. We show that our system is compatible with diagnostic AI models from multiple manufacturers, obtaining enhanced accuracy (sensitivity and/or specificity) relative to clinician-only or AI-only baselines in clinical workflows that screen for breast cancer or tuberculosis. For breast cancer, we demonstrate the first system that exceeds the accuracy of double-reading with arbitration (the “gold standard” of care) in a large representative UK screening program, with 25% reduction in false positives despite equivalent true-positive detection, while achieving a 66% reduction in clinical workload. In two separate US datasets, CoDoC exceeds the accuracy of single-reading by board certified radiologists and two different standalone state-of-the-art AI systems, with generalisation of this finding in different diagnostic AI manufacturers. For TB screening with chest X-rays, CoDoC improved specificity (while maintaining sensitivity) compared to standalone AI or clinicians for 3 of 5 commercially available diagnostic AI systems (5–15% reduction in false positives). Further, we show the limits of confidence score based deferral systems for medical AI, by demonstrating that no deferral strategy could have achieved significant improvement on the remaining two diagnostic AI systems. Our comprehensive assessment demonstrates that the superiority of CoDoC is sustained in multiple realistic stress tests for generalisation of medical AI tools along four axes: variation in the medical imaging modality; variation in clinical settings and human experts; different clinical deferral pathways within a given modality; and different AI softwares. Further, given the simplicity of CoDoC we believe that practitioners can easily adapt it and we provide an open-source implementation to encourage widespread further research and application.
Motivation: Computational models that accurately identify high-affinity protein-compound pairs can accelerate drug discovery pipelines. These models aim to learn binding mechanics through drug-target interaction datasets and use the learned knowledge for predicting the affinity of an input protein-compound pair. However, the datasets they rely on bear misleading patterns that bias models towards memorizing dataset-specific biomolecule properties, instead of learning binding mechanics. This results in models that struggle while predicting drug-target affinities (DTA), especially between de novo biomolecules. Here we present Debi-asedDTA, the first DTA model debiasing approach that avoids dataset biases in order to boost affinity prediction for novel biomolecules. DebiasedDTA uses ensemble learning and sample weight adaptation for bias identification and avoidance and is applicable to almost all existing DTA prediction models. Results:The results show that DebiasedDTA can boost models while predicting the interactions between novel biomolecules. Known biomolecules also benefit from the performance improvement, especially when the test biomolecules are dissimilar to the training set. The experiments also show that DebiasedDTA can augment DTA prediction models of different input and model structures and is able to avoid biases of different sources. Availability and Implementation: The source code, the models, and the datasets are freely available for download at https://github.com/boun-tabi/debiaseddta-reproduce, implementation in Python3, and supported for Linux, MacOS and MS Windows.
Recent studies have shown that heavy tails can emerge in stochastic optimization and that the heaviness of the tails has links to the generalization error. While these studies have shed light on interesting aspects of the generalization behavior in modern settings, they relied on strong topological and statistical regularity assumptions, which are hard to verify in practice. Furthermore, it has been empirically illustrated that the relation between heavy tails and generalization might not always be monotonic in practice, contrary to the conclusions of existing theory. In this study, we establish novel links between the tail behavior and generalization properties of stochastic gradient descent (SGD), through the lens of algorithmic stability. We consider a quadratic optimization problem and use a heavy-tailed stochastic differential equation as a proxy for modeling the heavy-tailed behavior emerging in SGD. We then prove uniform stability bounds, which reveal the following outcomes: (i) Without making any exotic assumptions, we show that SGD will not be stable if the stability is measured with the squared-loss x → x 2 , whereas it in turn becomes stable if the stability is instead measured with a surrogate loss x → |x| p with some p < 2. (ii) Depending on the variance of the data, there exists a 'threshold of heavy-tailedness' such that the generalization error decreases as the tails become heavier, as long as the tails are lighter than this threshold. This suggests that the relation between heavy tails and generalization is not globally monotonic. (iii) We prove matching lower-bounds on uniform stability, implying that our bounds are tight in terms of the heaviness of the tails. We support our theory with synthetic and real neural network experiments.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.