Purpose
Contouring organs at risk remains a largely manual task, which is time consuming and prone to variation. Deep learning-based delineation (DLD) shows promise both in terms of quality and speed, but it does not yet perform perfectly. Because of that, manual checking of DLD is still recommended. There are currently no commercial tools to focus attention on the areas of greatest uncertainty within a DLD contour. Therefore, we explore the use of spatial probability maps (SPMs) to help efficiency and reproducibility of DLD checking and correction, using the salivary glands as the paradigm.
Methods and Materials
A 3-dimensional fully convolutional network was trained with 315/264 parotid/submandibular glands. Subsequently, SPMs were created using Monte Carlo dropout (MCD). The method was boosted by placing a Gaussian distribution (GD) over the model's parameters during sampling (MCD + GD). MCD and MCD + GD were quantitatively compared and the SPMs were visually inspected.
Results
The addition of the GD appears to increase the method's ability to detect uncertainty. In general, this technique demonstrated uncertainty in areas that (1) have lower contrast, (2) are less consistently contoured by clinicians, and (3) deviate from the anatomic norm.
Conclusions
We believe the integration of uncertainty information into contours made using DLD is an important step in highlighting where a contour may be less reliable. We have shown how SPMs are one way to achieve this and how they may be integrated into the online adaptive radiation therapy workflow.
Background
Deep learning-based delineation of organs-at-risk for radiotherapy purposes has been investigated to reduce the time-intensiveness and inter-/intra-observer variability associated with manual delineation. We systematically evaluated ways to improve the performance and reliability of deep learning for organ-at-risk segmentation, with the salivary glands as the paradigm. Improving deep learning performance is clinically relevant with applications ranging from the initial contouring process, to on-line adaptive radiotherapy.
Methods
Various experiments were designed: increasing the amount of training data (1) with original images, (2) with traditional data augmentation and (3) with domain-specific data augmentation; (4) the influence of data quality was tested by comparing training/testing on clinical versus curated contours, (5) the effect of using several custom cost functions was explored, and (6) patient-specific Hounsfield unit windowing was applied during inference; lastly, (7) the effect of model ensembles was analyzed. Model performance was measured with geometric parameters and model reliability with those parameters’ variance.
Results
A positive effect was observed from increasing the (1) training set size, (2/3) data augmentation, (6) patient-specific Hounsfield unit windowing and (7) model ensembles. The effects of the strategies on performance diminished when the base model performance was already ‘high’. The effect of combining all beneficial strategies was an increase in average Sørensen–Dice coefficient of about 4% and 3% and a decrease in standard deviation of about 1% and 1% for the submandibular and parotid gland, respectively.
Conclusions
A subset of the strategies that were investigated provided a positive effect on model performance and reliability. The clinical impact of such strategies would be an expected reduction in post-segmentation editing, which facilitates the adoption of deep learning for autonomous automated salivary gland segmentation.
BackgroundClinical data used to train deep learning models are often not clean data. They can contain imperfections in both the imaging data and the corresponding segmentations.PurposeThis study investigates the influence of data imperfections on the performance of deep learning models for parotid gland segmentation. This was done in a controlled manner by using synthesized data. The insights this study provides may be used to make deep learning models better and more reliable.MethodsThe data were synthesized by using the clinical segmentations, creating a pseudo ground‐truth in the process. Three kinds of imperfections were simulated: incorrect segmentations, low image contrast, and artifacts in the imaging data. The severity of each imperfection was varied in five levels. Models resulting from training sets from each of the five levels were cross‐evaluated with test sets from each of the five levels.ResultsUsing synthesized data led to almost perfect parotid gland segmentation when no error was added. Lowering the quality of the parotid gland segmentations used for training substantially lowered the model performance. Additionally, lowering the image quality of the training data by decreasing the contrast or introducing artifacts made the resulting models more robust to data containing those respective kinds of data imperfection.ConclusionThis study demonstrated the importance of good‐quality segmentations for deep learning training and it shows that using low‐quality imaging data for training can enhance the robustness of the resulting models.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.