UNETR: Transformers for 3D Medical Image Segmentation

Hatamizadeh, Ali; Tang, Yucheng; Nath, Vishwesh; Yang, Dong; Myronenko, Andriy; Landman, Bennett A.; Roth, Holger R.; Xu, Daguang

doi:10.1109/wacv51458.2022.00181

Cited by 875 publications

(520 citation statements)

References 20 publications

Supporting

Mentioning

514

Contrasting

Unclassified

Order By: Relevance

“…In TransUNet [1], convolutional layer was used as a feature extractor to obtain detailed information from raw images; it then generated feature maps which are put into Transformer layer to obtain global information. UNETR [49] proposed a 3D Transformercombining architecture for medical images, which treated Transformer layer as encoder to extract features and convolutional layer as decoder. A great amount of such work focused on taking advantage of both Transformer's long-range dependency and CNN's inductive bias.…”

Section: Transformers For Segmentation Tasksmentioning

confidence: 99%

“…In Table 3, we compare the numbers of parameters and floating point operations (FLOPs) of our proposed D-Former with those of different 3D medical image segmentation models, including UNETR [49], CoTr [50], TransBTS [27], and nnFormer [42]. The number of FLOPs is calculated based on the input image size of 64×128×128 for fair comparison.…”

Section: Comparison Of Model Complexitymentioning

confidence: 99%

“…Recently, Transformers have achieved excellent outcomes on a variety of vision tasks [13-15, 42, 45, 46], including image recognition [15,[45][46][47][48], semantic segmentation [42], and object detection [13,14]. On semantic medical image segmentation, Transformer-combined architectures can be divided into two categories: the main one adopts self-attention like operations to complement CNNs [1,[49][50][51]; the other uses pure Transformers to constitute encoderdecoder architectures so as to capture deep representations and predict the class of each image pixel [42][43][44]53].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

D-Former: A U-shaped Dilated Transformer for 3D Medical Image Segmentation

Wu¹,

Liao²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Computer-aided medical image segmentation has been applied widely in diagnosis and treatment to obtain clinically useful information of shapes and volumes of target organs and tissues. In the past several years, convolutional neural network (CNN) based methods (e.g., U-Net) have dominated this area, but still suffered from inadequate long-range information capturing. Hence, recent work presented computer vision Transformer variants for medical image segmentation tasks and obtained promising performances. Such Transformers model longrange dependency by computing pair-wise patch relations. However, they incur prohibitive computational costs, especially on 3D medical

show abstract

Section: Transformers For Segmentation Tasksmentioning

confidence: 99%

Section: Comparison Of Model Complexitymentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

D-Former: A U-shaped Dilated Transformer for 3D Medical Image Segmentation

Wu¹,

Liao²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…For 3D medical image segmentation, Xie et al [28] proposed a model comprising a CNN backbone to extract features, a Transformer to model long-range dependencies, and a CNN decoder to construct the segmentation map. More recently, Hatamizadeh et al [29] proposed UNETR, which utilizes ViT as the main encoder but directly connects it to the convolutional decoder via skip connections, as opposed to using a Transformer only in the bridge. Since self-attention is prohibitively expensive on long sequences, all these models apply Transformer on a low-resolution level after either patch embedding or a CNN backbone, making them fail to fully exploit the global context at the higher resolutions.…”

Section: Related Workmentioning

confidence: 99%

Factorizer: A Scalable Interpretable Approach to Context Modeling for Medical Image Segmentation

Ashtari¹,

Sima²,

Lathauwer³

et al. 2022

Preprint

View full text Add to dashboard Cite

Convolutional Neural Networks (CNNs) with U-shaped architectures have dominated medical image segmentation, which is crucial for various clinical purposes. However, the inherent locality of convolution makes CNNs fail to fully exploit global context, essential for better recognition of some structures, e.g., brain lesions. Transformers have recently proved promising performance on vision tasks, including semantic segmentation, mainly due to their capability of modeling long-range dependencies. Nevertheless, the quadratic complexity of attention makes existing Transformer-based models use self-attention layers only after somehow reducing the image resolution, which limits the ability to capture global contexts present at higher resolutions. Therefore, this work introduces a family of models, dubbed Factorizer, which leverages the power of low-rank matrix factorization for constructing an end-to-end segmentation model. Specifically, we propose a linearly scalable approach to context modeling, formulating Nonnegative Matrix Factorization (NMF) as a differentiable layer integrated into a U-shaped architecture. The shifted window technique is also utilized in combination with NMF to effectively aggregate local information. Factorizers compete favorably with CNNs and Transformers in terms of accuracy, scalability, and interpretability, achieving state-of-the-art results on the BraTS dataset for brain tumor segmentation, with Dice scores of 79.33%, 83.14%, and 90.16% for enhancing tumor, tumor core, and whole tumor, respectively. Highly meaningful NMF components give an additional interpretability advantage to Factorizers over CNNs and Transformers. Moreover, our ablation studies reveal a distinctive feature of Factorizers that enables a significant speed-up in inference for a trained Factorizer without any extra steps and without sacrificing much accuracy.

show abstract

“…Image segmentation is an important part of medical image analysis. In particular, accurate and robust medical image segmentation can play a cornerstone role in computer-aided diagnosis and image-guided clinical surgery [1,2].…”

Section: Introductionmentioning

confidence: 99%

Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation

Cao¹,

Wang²,

Chen³

et al. 2021

Preprint

189

294

View full text Add to dashboard Cite

In the past few years, convolutional neural networks (CNNs) have achieved milestones in medical image analysis. Especially, the deep neural networks based on U-shaped architecture and skip-connections have been widely applied in a variety of medical image tasks. However, although CNN has achieved excellent performance, it cannot learn global and long-range semantic information interaction well due to the locality of convolution operation. In this paper, we propose Swin-Unet, which is a Unet-like pure Transformer for medical image segmentation. The tokenized image patches are fed into the Transformer-based Ushaped Encoder-Decoder architecture with skip-connections for localglobal semantic feature learning. Specifically, we use hierarchical Swin Transformer with shifted windows as the encoder to extract context features. And a symmetric Swin Transformer-based decoder with patch expanding layer is designed to perform the up-sampling operation to restore the spatial resolution of the feature maps. Under the direct downsampling and up-sampling of the inputs and outputs by 4×, experiments on multi-organ and cardiac segmentation tasks demonstrate that the pure Transformer-based U-shaped Encoder-Decoder network outperforms those methods with full-convolution or the combination of transformer and convolution. The codes and trained models will be publicly available at https://github.com/HuCaoFighting/Swin-Unet.

show abstract

UNETR: Transformers for 3D Medical Image Segmentation

Cited by 875 publications

References 20 publications

D-Former: A U-shaped Dilated Transformer for 3D Medical Image Segmentation

D-Former: A U-shaped Dilated Transformer for 3D Medical Image Segmentation

Factorizer: A Scalable Interpretable Approach to Context Modeling for Medical Image Segmentation

Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation

Contact Info

Product

Resources

About