The purpose of this work was to characterize expert variation in segmentation of intracranial structures pertinent to radiation therapy, and to assess a registration-driven atlas-based segmentation algorithm in that context. Eight experts were recruited to segment the brainstem, optic chiasm, optic nerves, and eyes, of 20 patients who underwent therapy for large space-occupying tumors. Performance variability was assessed through three geometric measures: volume, Dice similarity coefficient, and Euclidean distance. In addition, two simulated ground truth segmentations were calculated via the simultaneous truth and performance level estimation (STAPLE) algorithm and a novel application of probability maps. The experts and automatic system were found to generate structures of similar volume, though the experts exhibited higher variation with respect to tubular structures. No difference was found between the mean Dice coefficient (DSC) of the automatic and expert delineations as a group at a 5% significance level over all cases and organs. The larger structures of the brainstem and eyes exhibited mean DSC of approximately 0.8–0.9, whereas the tubular chiasm and nerves were lower, approximately 0.4–0.5. Similarly low DSC have been reported previously without the context of several experts and patient volumes. This study, however, provides evidence that experts are similarly challenged. The average maximum distances (maximum inside, maximum outside) from a simulated ground truth ranged from (−4.3, +5.4) mm for the automatic system to (−3.9, +7.5) mm for the experts considered as a group. Over all the structures in a rank of true positive rates at a 2 mm threshold from the simulated ground truth, the automatic system ranked second of the nine raters. This work underscores the need for large scale studies utilizing statistically robust numbers of patients and experts in evaluating quality of automatic algorithms.
Identification of error in non-rigid registration is a critical problem in the medical image processing community. We recently proposed an algorithm that we call “Assessing Quality Using Image Registration Circuits” (AQUIRC) to identify non-rigid registration errors and have tested its performance using simulated cases. In this article, we extend our previous work to assess AQUIRC’s ability to detect local non-rigid registration errors and validate it quantitatively at specific clinical landmarks, namely the Anterior Commissure (AC) and the Posterior Commissure (PC). To test our approach on a representative range of error we utilize 5 different registration methods and use 100 target images and 9 atlas images. Our results show that AQUIRC’s measure of registration quality correlates with the true target registration error (TRE) at these selected landmarks with an R2 = 0.542. To compare our method to a more conventional approach, we compute Local Normalized Correlation Coefficient (LNCC) and show that AQUIRC performs similarly. However, a multi-linear regression performed with both AQUIRC’s measure and LNCC shows a higher correlation with TRE than correlations obtained with either measure alone, thus showing the complementarity of these quality measures. We conclude the article by showing that the AQUIRC algorithm can be used to reduce registration errors for all five algorithms.
Image segmentation has become a vital and often rate limiting step in modern radiotherapy treatment planning. In recent years the pace and scope of algorithm development, and even introduction into the clinic, have far exceeded evaluative studies. In this work we build upon our previous evaluation of a registration driven segmentation algorithm in the context of 8 expert raters and 20 patients who underwent radiotherapy for large space-occupying tumors in the brain. In this work we tested four hypotheses concerning the impact of manual segmentation editing in a randomized single-blinded study. We tested these hypotheses on the normal structures of the brainstem, optic chiasm, eyes and optic nerves using the Dice similarity coefficient, volume, and signed Euclidean distance error to evaluate the impact of editing on inter-rater variance and accuracy. Accuracy analyses relied on two simulated ground truth estimation methods: STAPLE and a novel implementation of probability maps. The experts were presented with automatic, their own, and their peers’ segmentations from our previous study to edit. We found, independent of source, editing reduced inter-rater variance while maintaining or improving accuracy and improving efficiency with at least 60% reduction in contouring time. In areas where raters performed poorly contouring from scratch, editing of the automatic segmentations reduced the prevalence of total anatomical miss from approximately 16% to 8% of the total slices contained within the ground truth estimations. These findings suggest that contour editing could be useful for consensus building such as in developing delineation standards, and that both automated methods and even perhaps less sophisticated atlases could improve efficiency, inter-rater variance, and accuracy.
Deep brain stimulation, which is used to treat various neurological disorders, involves implanting a permanent electrode into precise targets deep in the brain. Reaching these targets safely is difficult because surgeons have to plan trajectories that avoid critical structures and reach targets within specific angles. A number of systems have been proposed to assist surgeons in this task. These typically involve formulating constraints as cost terms, weighting them by surgical importance, and searching for optimal trajectories, in which constraints and their weights reflect local practice. Assessing the performance of such systems is challenging because of the lack of ground truth and clear consensus on an optimal approach among surgeons. Due to difficulties in coordinating inter-institution evaluation studies, these have been performed so far at the sites at which the systems are developed. Whether or not a scheme developed at one site can also be used at another is thus unknown. In this article, we conduct a study that involves four surgeons at three institutions to determine whether or not constraints and their associated weights can be used across institutions. Through a series of experiments, we show that a single set of weights performs well for all surgeons in our group. Out of 60 trajectories, our trajectories were accepted by a majority of neurosurgeons in 95% of the cases and the average acceptance rate was 90%. This study suggests, albeit on a limited number of surgeons, that the same system can be used to provide assistance across multiple sites and surgeons.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.