A marker strongly associated with outcome (or disease) is often assumed to be effective for classifying persons according to their current or future outcome. However, for this assumption to be true, the associated odds ratio must be of a magnitude rarely seen in epidemiologic studies. In this paper, an illustration of the relation between odds ratios and receiver operating characteristic curves shows, for example, that a marker with an odds ratio of as high as 3 is in fact a very poor classification tool. If a marker identifies 10% of controls as positive (false positives) and has an odds ratio of 3, then it will correctly identify only 25% of cases as positive (true positives). The authors illustrate that a single measure of association such as an odds ratio does not meaningfully describe a marker's ability to classify subjects. Appropriate statistical methods for assessing and reporting the classification power of a marker are described. In addition, the serious pitfalls of using more traditional methods based on parameters in logistic regression models are illustrated.
IMPORTANCE A breast pathology diagnosis provides the basis for clinical treatment and management decisions; however, its accuracy is inadequately understood. OBJECTIVES To quantify the magnitude of diagnostic disagreement among pathologists compared with a consensus panel reference diagnosis and to evaluate associated patient and pathologist characteristics. DESIGN, SETTING, AND PARTICIPANTS Study of pathologists who interpret breast biopsies in clinical practices in 8 US states. EXPOSURES Participants independently interpreted slides between November 2011 and May 2014 from test sets of 60 breast biopsies (240 total cases, 1 slide per case), including 23 cases of invasive breast cancer, 73 ductal carcinoma in situ (DCIS), 72 with atypical hyperplasia (atypia), and 72 benign cases without atypia. Participants were blinded to the interpretations of other study pathologists and consensus panel members. Among the 3 consensus panel members, unanimous agreement of their independent diagnoses was 75%, and concordance with the consensus-derived reference diagnoses was 90.3%. MAIN OUTCOMES AND MEASURES The proportions of diagnoses overinterpreted and underinterpreted relative to the consensus-derived reference diagnoses were assessed. RESULTS Sixty-five percent of invited, responding pathologists were eligible and consented to participate. Of these, 91% (N = 115) completed the study, providing 6900 individual case diagnoses. Compared with the consensus-derived reference diagnosis, the overall concordance rate of diagnostic interpretations of participating pathologists was 75.3% (95% CI, 73.4%–77.0%; 5194 of 6900 interpretations). Consensus ReferenceDiagnosis Pathologist Interpretation vs Consensus-Derived Reference Diagnosis, % (95% CI) No. ofInterpretations Overall ConcordanceRate OverinterpretationRate UnderinterpretationRate Benign without atypia 2070 87 (85–89) 13 (11–15) Atypia 2070 48 (44–52) 17 (15–21) 35 (31–39) DCIS 2097 84 (82–86) 3 (2–4) 13 (12–15) Invasive carcinoma 663 96 (94–97) 4 (3–6) Disagreement with the reference diagnosis was statistically significantly higher among biopsies from women with higher (n = 122) vs lower (n = 118) breast density on prior mammograms (overall concordance rate, 73% [95% CI, 71%–75%] for higher vs 77% [95% CI, 75%–80%] for lower, P < .001), and among pathologists who interpreted lower weekly case volumes (P < .001) or worked in smaller practices (P = .034) or nonacademic settings (P = .007). CONCLUSIONS AND RELEVANCE In this study of pathologists, in which diagnostic interpretation was based on a single breast biopsy slide, overall agreement between the individual pathologists’ interpretations and the expert consensus–derived reference diagnoses was 75.3%, with the highest level of concordance for invasive carcinoma and lower levels of concordance for DCIS and atypia. Further research is needed to understand the relationship of these findings with patient management.
A systematic baseline endoscopic biopsy protocol using histology and flow cytometry identifies subsets of patients with Barrett's esophagus at low and high risk for progression to cancer. Patients whose baseline biopsies are negative, indefinite, or low-grade displasia without increased 4N or aneuploidy may have surveillance deferred for up to 5 yr. Patients with cytometric abnormalities merit more frequent surveillance, and management of high-grade dysplasia can be individualized.
Objective To quantify the accuracy and reproducibility of pathologists’ diagnoses of melanocytic skin lesions. Design Observer accuracy and reproducibility study. Setting 10 US states. Participants Skin biopsy cases (n=240), grouped into sets of 36 or 48. Pathologists from 10 US states were randomized to independently interpret the same set on two occasions (phases 1 and 2), at least eight months apart. Main outcome measures Pathologists’ interpretations were condensed into five classes: I (eg, nevus or mild atypia); II (eg, moderate atypia); III (eg, severe atypia or melanoma in situ); IV (eg, pathologic stage T1a (pT1a) early invasive melanoma); and V (eg, ≥pT1b invasive melanoma). Reproducibility was assessed by intraobserver and interobserver concordance rates, and accuracy by concordance with three reference diagnoses. Results In phase 1, 187 pathologists completed 8976 independent case interpretations resulting in an average of 10 (SD 4) different diagnostic terms applied to each case. Among pathologists interpreting the same cases in both phases, when pathologists diagnosed a case as class I or class V during phase 1, they gave the same diagnosis in phase 2 for the majority of cases (class I 76.7%; class V 82.6%). However, the intraobserver reproducibility was lower for cases interpreted as class II (35.2%), class III (59.5%), and class IV (63.2%). Average interobserver concordance rates were lower, but with similar trends. Accuracy using a consensus diagnosis of experienced pathologists as reference varied by class: I, 92% (95% confidence interval 90% to 94%); II, 25% (22% to 28%); III, 40% (37% to 44%); IV, 43% (39% to 46%); and V, 72% (69% to 75%). It is estimated that at a population level, 82.8% (81.0% to 84.5%) of melanocytic skin biopsy diagnoses would have their diagnosis verified if reviewed by a consensus reference panel of experienced pathologists, with 8.0% (6.2% to 9.9%) of cases overinterpreted by the initial pathologist and 9.2% (8.8% to 9.6%) underinterpreted. Conclusion Diagnoses spanning moderately dysplastic nevi to early stage invasive melanoma were neither reproducible nor accurate in this large study of pathologists in the USA. Efforts to improve clinical practice should include using a standardized classification system, acknowledging uncertainty in pathology reports, and developing tools such as molecular markers to support pathologists’ visual assessments.
Purpose: Early detection would significantly decrease the mortality rate of ovarian cancer. In this study, we characterize and validate the combination of six serum biomarkers that discriminate between disease-free and ovarian cancer patients with high efficiency. Experimental Design: We analyzed 362 healthy controls and 156 newly diagnosed ovarian cancer patients. Concentrations of leptin, prolactin, osteopontin, insulin-like growth factor II, macrophage inhibitory factor, and CA-125 were determined using a multiplex, bead-based, immunoassay system. All six markers were evaluated in a training set (181 samples from the control group and 113 samples from OC patients) and a test set (181sample control group and 43 ovarian cancer). Results: Multiplex and ELISA exhibited the same pattern of expression for all the biomarkers. None of the biomarkers by themselves were good enough to differentiate healthy versus cancer cells. However, the combination of the six markers provided a better differentiation than CA-125. Four models with <2% classification error in training sets all had significant improvement (sensitivity 84%-98% at specificity 95%) over CA-125 (sensitivity 72% at specificity 95%) in the test set. The chosen model correctly classified 221out of 224 specimens in the test set, with a classification accuracy of 98.7%. Conclusions: We describe the first blood biomarker test with a sensitivity of 95.3% and a specificity of 99.4% for the detection of ovarian cancer. Six markers provided a significant improvement over CA-125 alone for ovarian cancer detection. Validation was performed with a blinded cohort. This novel multiplex platform has the potential for efficient screening in patients who are at high risk for ovarian cancer.
There are two popular statistical approaches to biomarker evaluation. One models the risk of disease (or disease outcome) with, for example, logistic regression. A marker is considered useful if it has a strong effect on risk. The second evaluates classification performance by use of measures such as sensitivity, specificity, predictive values, and receiver operating characteristic curves. There is controversy about which approach is more appropriate. Moreover, the two approaches can give contradictory results on the same data. The authors present a new graphic, the predictiveness curve, which complements the risk modeling approach. It assesses the usefulness of a risk model when applied to the population. Although the predictiveness curve relates to classification performance measures, it also displays essential information about risk that is not displayed by the receiver operating characteristic curve. The authors propose that the predictiveness and classification performance of a marker, displayed together in an integrated plot, provide a comprehensive and cohesive assessment of a risk marker or model. The methods are demonstrated with data on prostate-specific antigen and risk factors from the Prostate Cancer Prevention Trial, 1993-2003.
No single biomarker for cancer is considered adequately sensitive and specific for cancer screening. It is expected that the results of multiple markers will need to be combined in order to yield adequately accurate classification. Typically, the objective function that is optimized for combining markers is the likelihood function. In this article, we consider an alternative objective function-the area under the empirical receiver operating characteristic curve (AUC). We note that it yields consistent estimates of parameters in a generalized linear model for the risk score but does not require specifying the link function. Like logistic regression, it yields consistent estimation with case-control or cohort data. Simulation studies suggest that AUC-based classification scores have performance comparable with logistic likelihood-based scores when the logistic regression model holds. Analysis of data from a proteomics biomarker study shows that performance can be far superior to logistic regression derived scores when the logistic regression model does not hold. Model fitting by maximizing the AUC rather than the likelihood should be considered when the goal is to derive a marker combination score for classification or prediction.
New methodology has been proposed in recent years for evaluating the improvement in prediction performance gained by adding a new predictor, Y, to a risk model containing a set of baseline predictors, X, for a binary outcome D. We prove theoretically that null hypotheses concerning no improvement in performance are equivalent to the simple null hypothesis that Y is not a risk factor when controlling for X, H0: P (D = 1|X, Y) = P (D = 1|X). Therefore, testing for improvement in prediction performance is redundant if Y has already been shown to be a risk factor. We also investigate properties of tests through simulation studies, focusing on the change in the area under the ROC curve (AUC). An unexpected finding is that standard testing procedures that do not adjust for variability in estimated regression coefficients are extremely conservative. This may explain why the AUC is widely considered insensitive to improvements in prediction performance and suggests that the problem of insensitivity has to do with use of invalid procedures for inference rather than with the measure itself. To avoid redundant testing and use of potentially problematic methods for inference, we recommend that hypothesis testing for no improvement be limited to evaluation of Y as a risk factor, for which methods are well developed and widely available. Analyses of measures of prediction performance should focus on estimation rather than on testing for no improvement in performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.