Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis

Karimi, Davood; Dou, Haoran; Warfield, Simon K.; Gholipour, Ali

doi:10.1016/j.media.2020.101759

Cited by 446 publications

(282 citation statements)

References 110 publications

Supporting

Mentioning

234

Contrasting

Order By: Relevance

“…Therefore, it can be assumed, that there is a certain amount of noise in the training data, which might affect the accuracy of the models trained on it. Implementing a human-in-the loop approach for partially correcting the label noise could further improve performance of networks trained on the CheXpert dataset 21 . Our findings differ from applied techniques used in previous literature, where deeper network architectures, mainly a DenseNet-121, were used to classify the CheXpert data set 6,9,22 .…”

Section: Discussionmentioning

confidence: 99%

Comparing different deep learning architectures for classification of chest radiographs

Bressem

Adams

Erxleben

et al. 2020

Sci Rep

140

View full text Add to dashboard Cite

Chest radiographs are among the most frequently acquired images in radiology and are often the subject of computer vision research. However, most of the models used to classify chest radiographs are derived from openly available deep neural networks, trained on large image datasets. These datasets differ from chest radiographs in that they are mostly color images and have substantially more labels. Therefore, very deep convolutional neural networks (CNN) designed for ImageNet and often representing more complex relationships, might not be required for the comparably simpler task of classifying medical image data. Sixteen different architectures of CNN were compared regarding the classification performance on two openly available datasets, the CheXpert and COVID-19 Image Data Collection. Areas under the receiver operating characteristics curves (AUROC) between 0.83 and 0.89 could be achieved on the CheXpert dataset. On the COVID-19 Image Data Collection, all models showed an excellent ability to detect COVID-19 and non-COVID pneumonia with AUROC values between 0.983 and 0.998. It could be observed, that more shallow networks may achieve results comparable to their deeper and more complex counterparts with shorter training times, enabling classification performances on medical image data close to the state-of-the-art methods even when using limited hardware. Chest radiographs are among the most frequently used imaging procedures in radiology. They have been widely employed in the field of computer vision, as chest radiographs are a standardized technique and, if compared to other radiological examinations such as computed tomography or magnetic resonance imaging, contain a smaller group of relevant pathologies. Although many artificial neural networks for the classification of chest radiographs have been developed, it is still subject to intensive research. Only a few groups design their own networks from scratch, while most use already established architectures, such as ResNet-50 or DenseNet-121 (with 50 and 121 representing the number of layers within the respective neural network) 1-6. These neural networks have often been trained on large, openly available datasets, such as ImageNet, and are therefore already able to recognize numerous image features. When training a model for a new task, such as the classification of chest radiographs, the use of pre-trained networks may improve the training speed and accuracy of the new model, since important image features that have already been learned can be transferred to the new task and do not have to be learned again. However, the feature space of freely available data sets such as ImageNet differs from chest radiographs as they contain color images and more categories. The ImageNet Challenge includes 1,000 possible categories per image, while CheXpert, a large freely available data set of chest radiographs, only distinguishes between 14 categories (or classes) 7 , and the COVID-19 Image Data Collection only differentiates between three classes 8. Although the ImageNet...

show abstract

Section: Discussionmentioning

confidence: 99%

Comparing different deep learning architectures for classification of chest radiographs

Bressem

Adams

Erxleben

et al. 2020

Sci Rep

140

View full text Add to dashboard Cite

show abstract

“…Labelled-image databases are usually used for the training and testing of deep neural networks and were also applied in our research. The research community has recognized the importance of the impact of the label errors (label noise) in training datasets on the model accuracy and have introduced works attempting to understand noisy training labels [25].…”

Section: Related Workmentioning

confidence: 99%

Visual-Based Person Detection for Search-and-Rescue with UAS: Humans vs. Machine Learning Algorithm

et al. 2020

View full text Add to dashboard Cite

Unmanned Aircraft Systems (UASs) have been recognized as an important resource in search-and-rescue (SAR) missions and, as such, have been used by the Croatian Mountain Search and Rescue (CMRS) service for over seven years. The UAS scans and photographs the terrain. The high-resolution images are afterwards analyzed by SAR members to detect missing persons or to find some usable trace. It is a drawn out, tiresome process prone to human error. To facilitate and speed up mission image processing and increase detection accuracy, we have developed several image-processing algorithms. The latest are convolutional neural network (CNN)-based. CNNs were trained on a specially developed image database, named HERIDAL. Although these algorithms achieve excellent recall, the efficiency of the algorithm in actual SAR missions and its comparison with expert detection must be investigated. A series of mission simulations are planned and recorded for this purpose. They are processed and labelled by a developed algorithm. A web application was developed by which experts analyzed raw and processed mission images. The algorithm achieved better recall compared to an expert, but the experts achieved better accuracy when they analyzed images that were already processed and labelled.

show abstract

“…Dealing with label noise in the context of supervised machine learning is a well-known issue that challenges researchers since the early developments of classifiers, as detailed in [1]. A vast corpus of techniques has been developed, most of which are listed in a recent exhaustive survey [7], where even the latest methods related to general deep learning algorithms are indexed. Interestingly, in [7], authors propose to classify techniques for handling label noise into six (possibly overlapping) categories.…”

Section: Related Workmentioning

confidence: 99%

“…None of the reconstructions based on the feature vector of the ''true class 0'' matches the initial image. Then, the corresponding reconstructions from feature vector ''5'' (5,6,7). When ghost feature vectors are allowed, the network is able to reconstruct the initial image from the feature vector ''5'', which corresponds to a non-true class, while it is not the case when no ghost feature vector is allowed.…”

Section: B Proof Of Conceptmentioning

confidence: 99%

Ghost Loss to Question the Reliability of Training Data

Deliège¹,

Cioppa²,

Droogenbroeck³

2020

IEEE Access

View full text Add to dashboard Cite

Supervised image classification problems rely on training data assumed to have been correctly annotated; this assumption underpins most works in the field of deep learning. In consequence, during its training, a network is forced to match the label provided by the annotator and is not given the flexibility to choose an alternative to inconsistencies that it might be able to detect. Therefore, erroneously labeled training images may end up ''correctly'' classified in classes which they do not actually belong to. This may reduce the performances of the network and thus incite to build more complex networks without even checking the quality of the training data. In this work, we question the reliability of the annotated datasets. For that purpose, we introduce the notion of ghost loss, which can be seen as a regular loss that is zeroed out for some predicted values in a deterministic way and that allows the network to choose an alternative to the given label without being penalized. After a proof of concept experiment, we use the ghost loss principle to detect confusing images and erroneously labeled images in well-known training datasets (MNIST, Fashion-MNIST, SVHN, CIFAR10) and we provide a new tool, called sanity matrix, for summarizing these confusions.

show abstract

Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis

Cited by 446 publications

References 110 publications

Comparing different deep learning architectures for classification of chest radiographs

Comparing different deep learning architectures for classification of chest radiographs

Visual-Based Person Detection for Search-and-Rescue with UAS: Humans vs. Machine Learning Algorithm

Ghost Loss to Question the Reliability of Training Data

Contact Info

Product

Resources

About