Self-Supervised Vision Transformers Learn Visual Concepts in Histopathology

Chen, Richard J.; Krishnan, Rahul G.

doi:10.48550/arxiv.2203.00585

Cited by 13 publications

(23 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We also evaluated the patch embedder performance of the ViT-S-16 and ViT-L-16 models when they are self-supervised by DINO 42 instead of pretrained on the ImageNet as used in Chen and Krishnan. 43 We further compared the performance of the above weakly supervised methods with non-weakly supervised methods by training a patch-level version of the patch embedders directly for classification using the slide-level labels as ground truth.…”

Section: Resultsmentioning

confidence: 99%

“… 10 0.860∗ ±0.023 0.745∗ ± 0.024 GNN (non-weakly-supervised) N/A Jaume et al. 41 0.832 ± 0.027 0.723 ± 0.025 ViT (ViT-S-16, DINO) ViT-WSI aggregator w/graph Chen and Krishnan 43 0.940 ± 0.004 0.870 ± 0.012 ViT (ViT-L-16, DINO) ViT-WSI aggregator w/graph 0.942∗ ±0.027 0.872 ± 0.031 Human Performance: Macro FPR = 0.0901 ± 0.078, Macro TPR = 0.9557 ± 0.042 11-class subtyping ResNet50 (non-weakly-supervised) N/A 0.745 ± 0.022 0.414 ± 0.029 ResNet50 CLAM-MB Lu et al. 15 0.845 ± 0.021 0.536 ± 0.027 ResNet50 ViT-WSI aggregator w/graph 0.873∗ ±0.017 0.556∗ ± 0.031 ViT (ViT-L-16, non-weakly-supervised) N/A 0.753 ± 0.027 0.425 ± 0.045 ViT (ViT-L-16) Max Pooling 0.837 ± 0.019 0.480 ± 0.031 ViT (ViT-L-16) CLAM-MB 0.860 ± 0.022 0.551 ± 0.041 ViT (ViT-L-16) ViT-WSI aggregator w/graph 0.887∗ ±0.024 0.563∗ ± 0.030 Inception v3 (non-weakly-supervised) N/A Coudray et al.…”

Section: Resultsmentioning

confidence: 99%

“… 10 0.695∗ ±0.020 0.373 ± 0.029 GNN (non-weakly-supervised) N/A Jaume et al. 41 0.672 ± 0.019 0.399 ± 0.028 ViT (ViT-S-16, DINO) ViT-WSI aggregator w/graph Chen and Krishnan 43 0.866 ± 0.017 0.558 ± 0.028 ViT (ViT-L-16, DINO) ViT-WSI aggregator w/graph 0.880∗ ±0.018 0.560 ± 0.026 Human Performance: Macro FPR = 0.0509 ± 0.042, Macro TPR = 0.9309 ± 0.074 TCGA-glioma subtyping ViT (ViT-L-16, non-weakly-supervised) N/A 0.845 ± 0.021 0.755 ± 0.033 ViT (ViT-L-16) Max Pooling 0.875 ± 0.021 0.824 ± 0.031 ViT (ViT-L-16) CLAM-MB 0.916 ± 0.018 0.826 ± 0.030 ViT (ViT-L-16) ViT-WSI aggregator w/graph 0.931∗ ±0.019 0.834∗ ± 0.030 Inception v3 (non-weakly-supervised) N/A Coudray et al. 10 0.804∗ ±0.020 0.723∗ ± 0.027 GNN (non-weakly-supervised) N/A Jaume et al.…”

Section: Resultsmentioning

confidence: 99%

See 2 more Smart Citations

Vision transformer-based weakly supervised histopathological image analysis of primary brain tumors

Li¹,

Cong²,

Chen³

et al. 2023

iScience

View full text Add to dashboard Cite

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Vision transformer-based weakly supervised histopathological image analysis of primary brain tumors

Li¹,

Cong²,

Chen³

et al. 2023

iScience

View full text Add to dashboard Cite

“…Self-supervised learning (SSL) has proven very effective for label-efficient fine-tuning in natural image classification (Chen et al, 2020;He et al, 2020), video classification (Diba et al, 2021;Kuang et al, 2021), and now even medical image classification and segmentation tasks (Azizi et al, 2021;Taleb et al, 2020;Tang et al, 2021). However, most successful medical applications of SSL operate on 2D data such as histopathological images and radiographs (Chen & Krishnan, 2022;Wang et al, 2021). Some recent studies have developed SSL methods for 3D medical image data, though this has been applied to CT and MRI, where this third dimension is spatial, not temporal (Tang et al, 2021;Taleb et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

Self-Supervised Learning of Echocardiogram Videos Enables Data-Efficient Clinical Diagnosis

Holste¹,

Oikonomou²,

Mortazavi³

et al. 2022

Preprint

View full text Add to dashboard Cite

Given the difficulty of obtaining high-quality labels for medical image recognition tasks, there is a need for deep learning techniques that can be adequately fine-tuned on small labeled data sets. Recent advances in self-supervised learning techniques have shown that such an in-domain representation learning approach can provide a strong initialization for supervised fine-tuning, proving much more data-efficient than standard transfer learning from a supervised pretraining task. However, these applications are not adapted to applications to medical diagnostics captured in a video format. With this progress in mind, we developed a self-supervised learning approach catered to echocardiogram videos with the goal of learning strong representations for downstream fine-tuning on the task of diagnosing aortic stenosis (AS), a common and dangerous disease of the aortic valve. When fine-tuned on 1% of the training data, our best self-supervised learning model achieves 0.818 AUC (95% CI: 0.794, 0.840), while the standard transfer learning approach reaches 0.644 AUC (95% CI: 0.610, 0.677). We also find that our self-supervised model attends more closely to the aortic valve when predicting severe AS as demonstrated by saliency map visualizations.Recently, deep learning techniques have been developed and adapted for a variety of cardiac diseases from 2D

show abstract

“…Moreover, vision transformers are more robust in multi-task learning [42]. For WSIs, vision transformers perform better in capturing fine-grained morphological features, such as cells and background tissue [43]. Visualization methods have filled a significant gap in neural network understanding in computer vision.…”

Section: Introductionmentioning

confidence: 99%

MIST: multiple instance learning network based on Swin Transformer for whole slide image classification of colorectal adenomas

Cai

Feng

Yin

et al. 2022

The Journal of Pathology

View full text Add to dashboard Cite

Colorectal adenoma is a recognized precancerous lesion of colorectal cancer (CRC), and at least 80% of colorectal cancers are malignantly transformed from it. Therefore, it is essential to distinguish benign from malignant adenomas in the early screening of colorectal cancer. Many deep learning computational pathology studies based on whole slide images (WSIs) have been proposed. Most approaches require manual annotation of lesion regions on WSIs, which is time-consuming and labor-intensive. This study proposes a new approach, MIST -Multiple Instance learning network based on the Swin Transformer, which can accurately classify colorectal adenoma WSIs only with slide-level labels. MIST uses the Swin Transformer as the backbone to extract features of images through self-supervised contrastive learning and uses a dual-stream multiple instance learning network to predict the class of slides. We trained and validated MIST on 666 WSIs collected from 480 colorectal adenoma patients in the

show abstract

Self-Supervised Vision Transformers Learn Visual Concepts in Histopathology

Cited by 13 publications

References 28 publications

Vision transformer-based weakly supervised histopathological image analysis of primary brain tumors

Vision transformer-based weakly supervised histopathological image analysis of primary brain tumors

Self-Supervised Learning of Echocardiogram Videos Enables Data-Efficient Clinical Diagnosis

MIST: multiple instance learning network based on Swin Transformer for whole slide image classification of colorectal adenomas

Contact Info

Product

Resources

About