Characterizing the performance of image segmentation approaches has been a persistent challenge. Performance analysis is important since segmentation algorithms often have limited accuracy and precision. Interactive drawing of the desired segmentation by human raters has often been the only acceptable approach, and yet suffers from intra-rater and inter-rater variability. Automated algorithms have been sought in order to remove the variability introduced by raters, but such algorithms must be assessed to ensure they are suitable for the task.The performance of raters (human or algorithmic) generating segmentations of medical images has been difficult to quantify because of the difficulty of obtaining or estimating a known true segmentation for clinical data. Although physical and digital phantoms can be constructed for which ground truth is known or readily estimated, such phantoms do not fully reflect clinical images due to the difficulty of constructing phantoms which reproduce the full range of imaging characteristics and normal and pathological anatomical variability observed in clinical data.Comparison to a collection of segmentations by raters is an attractive alternative since it can be carried out directly on the relevant clinical imaging data. However, the most appropriate measure or set of measures with which to compare such segmentations has not been clarified and several measures are used in practice.We present here an expectation-maximization algorithm for simultaneous truth and performance level estimation (STAPLE). The algorithm considers a collection of segmentations and computes a probabilistic estimate of the true segmentation and a measure of the performance level represented by each segmentation. The source of each segmentation in the collection may be an appropriately trained human rater or raters, or may be an automated segmentation algorithm. The probabilistic Correspondence to: Simon K. Warfield. This work was supported in part by the Whitaker Foundation, in part by the National Institutes of Health (NIH) under Grant R21 MH67054, Grant R01 LM007861, Grant P41 RR13218, Grant P01 CA67165, Grant R01 AG19513, Grant R01 CA86879, Grant R01 NS35142, Grant R33 CA99015, and Grant R21 CA89449, and in part by an award from the Center for Integration of Medicine and Innovative Technology. The Associate Editor responsible for coordinating the review of this paper and recommending its publication was M. A. Viergever. estimate of the true segmentation is formed by estimating an optimal combination of the segmentations, weighting each segmentation depending upon the estimated performance level, and incorporating a prior model for the spatial distribution of structures being segmented as well as spatial homogeneity constraints. STAPLE is straightforward to apply to clinical imaging data, it readily enables assessment of the performance of an automated image segmentation algorithm, and enables direct comparison of human rater and algorithm performance. NIH Public Access
The DSC value is a simple and useful summary measure of spatial overlap, which can be applied to studies of reproducibility and accuracy in image segmentation. We observed generally satisfactory but variable validation results in two clinical applications. This metric may be adapted for similar validation tasks.
R eceiver-operating characteristic (ROC) analysis was originally developed during World War II to analyze classification accuracy in differentiating signal from noise in radar detection. 1 Recently, the methodology has been adapted to several clinical areas heavily dependent on screening and diagnostic tests, 2-4 in particular, laboratory testing, 5 epidemiology, 6 radiology, 7-9 and bioinformatics. 10 ROC analysis is a useful tool for evaluating the performance of diagnostic tests and more generally for evaluating the accuracy of a statistical model (eg, logistic regression, linear discriminant analysis) that classifies subjects into 1 of 2 categories, diseased or nondiseased. Its function as a simple graphical tool for displaying the accuracy of a medical diagnostic test is one of the most well-known applications of ROC curve analysis. In Circulation from January 1, 1995, through December 5, 2005, 309 articles were published with the key phrase "receiver operating characteristic." In cardiology, diagnostic testing plays a fundamental role in clinical practice (eg, serum markers of myocardial necrosis, cardiac imaging tests). Predictive modeling to estimate expected outcomes such as mortality or adverse cardiac events based on patient risk characteristics also is common in cardiovascular research. ROC analysis is a useful tool in both of these situations.In this article, we begin by reviewing the measures of accuracy-sensitivity, specificity, and area under the curve (AUC)-that use the ROC curve. We also illustrate how these measures can be applied using the evaluation of a hypothetical new diagnostic test as an example. Diagnostic Test and Predictive ModelA diagnostic classification test typically yields binary, ordinal, or continuous outcomes. The simplest type, binary outcomes, arises from a screening test indicating whether the patient is nondiseased (Dxϭ0) or diseased (Dxϭ1). The screening test indicates whether the patient is likely to be diseased or not. When Ͼ2 categories are used, the test data can be on an ordinal rating scale; eg, echocardiographic grading of mitral regurgitation uses a 5-point ordinal (0, 1ϩ, 2ϩ, 3ϩ, 4ϩ) scale for disease severity. When a particular cutoff level or threshold is of particular interest, an ordinal scale may be dichotomized (eg, mitral regurgitation Յ2ϩ and Ͼ2ϩ), in which case methods for binary outcomes can be used. 7 Test data such as serum markers (brain natriuretic peptide 11 ) or physiological markers (coronary lumen diameter, 12 peak oxygen consumption 13 ) also may be acquired on a continuous scale.
Receiver operating characteristic (ROC) curves are frequently used in biomedical informatics research to evaluate classification and prediction models for decision support, diagnosis, and prognosis. ROC analysis investigates the accuracy of a model's ability to separate positive from negative cases (such as predicting the presence or absence of disease), and the results are independent of the prevalence of positive cases in the study population. It is especially useful in evaluating predictive models or other tests that produce output values over a continuous range, since it captures the trade-off between sensitivity and specificity over that range. There are many ways to conduct an ROC analysis. The best approach depends on the experiment; an inappropriate approach can easily lead to incorrect conclusions. In this article, we review the basic concepts of ROC analysis, illustrate their use with sample calculations, make recommendations drawn from the literature, and list readily available software.
In this tutorial article, the concepts of correlation and regression are reviewed and demonstrated. The authors review and compare two correlation coefficients, the Pearson correlation coefficient and the Spearman rho, for measuring linear and nonlinear relationships between two continuous variables. In the case of measuring the linear relationship between a predictor and an outcome variable, simple linear regression analysis is conducted. These statistical concepts are illustrated by using a data set from published literature to assess a computed tomography-guided interventional technique. These statistical methods are important for exploring the relationships between variables and can be applied to many radiologic studies.
A multichannel statistical classifier for detecting prostate cancer was developed and validated by combining information from three different magnetic resonance (MR) methodologies: T2-weighted, T2-mapping, and line scan diffusion imaging (LSDI). From these MR sequences, four different sets of image intensities were obtained: T2-weighted (T2W) from T2-weighted imaging, Apparent Diffusion Coefficient (ADC) from LSDI, and proton density (PD) and T2 (T2 Map) from T2-mapping imaging. Manually segmented tumor labels from a radiologist, which were validated by biopsy results, served as tumor "ground truth." Textural features were extracted from the images using co-occurrence matrix (CM) and discrete cosine transform (DCT). Anatomical location of voxels was described by a cylindrical coordinate system. A statistical jack-knife approach was used to evaluate our classifiers. Single-channel maximum likelihood (ML) classifiers were based on 1 of the 4 basic image intensities. Our multichannel classifiers: support vector machine (SVM) and Fisher linear discriminant (FLD), utilized five different sets of derived features. Each classifier generated a summary statistical map that indicated tumor likelihood in the peripheral zone (PZ) of the prostate gland. To assess classifier accuracy, the average areas under the receiver operator characteristic (ROC) curves over all subjects were compared. Our best FLD classifier achieved an average ROC area of 0.839(+/-0.064), and our best SVM classifier achieved an average ROC area of 0.761(+/-0.043). The T2W ML classifier, our best single-channel classifier, only achieved an average ROC area of 0.599(+/-0.146). Compared to the best single-channel ML classifier, our best multichannel FLD and SVM classifiers have statistically superior ROC performance (P=0.0003 and 0.0017, respectively) from pairwise two-sided t-test. By integrating the information from multiple images and capturing the textural and anatomical features in tumor areas, summary statistical maps can potentially aid in image-guided prostate biopsy and assist in guiding and controlling delivery of localized therapy under image guidance.
CT and MR imaging are equally accurate, and either modality can be used to stage advanced ovarian cancer.
Evidence from medication use in the real world setting can help to extrapolate and/or augment data obtained in randomized controlled trials and establishes a broad picture of a medication’s place in everyday clinical practice. By supplementing and complementing safety and efficacy data obtained in a narrowly defined (and often optimized) patient population in the clinical trial setting, real world evidence (RWE) may provide stakeholders with valuable information about the safety and effectiveness of a medication in large, heterogeneous populations. RWE is emerging as a credible information source; however, there is scope for enhancements to real world data (RWD) sources by understanding their complexities and applying the most appropriate analytical tools in order to extract relevant information. In addition to providing information for clinicians, RWE has the potential to meet the burden of evidence for regulatory considerations and may be used in approval of new indications for medications. Further understanding of RWD collection and analysis is needed if RWE is to achieve its full potential.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.