2022
DOI: 10.1109/tmi.2022.3167808
|View full text |Cite
|
Sign up to set email alerts
|

ResViT: Residual Vision Transformers for Multimodal Medical Image Synthesis

Abstract: Generative adversarial models with convolutional neural network (CNN) backbones have recently been established as state-of-the-art in numerous medical image synthesis tasks. However, CNNs are designed to perform local processing with compact filters, and this inductive bias compromises learning of contextual features. Here, we propose a novel generative adversarial approach for medical image synthesis, ResViT, that leverages the contextual sensitivity of vision transformers along with the precision of convolut… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
80
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
8
1

Relationship

1
8

Authors

Journals

citations
Cited by 208 publications
(109 citation statements)
references
References 104 publications
0
80
0
Order By: Relevance
“…DeepViT suggests establishing crosshead communication in order to regenerate attention maps in order to improve variety at various levels. KVT introduces the k-NN attention to take use of the proximity of picture patches and to disregard noisy tokens by calculating attentions solely for the top-k comparable tokens [37]. Refiner investigates attention expansion in higher-dimensional space and uses convolution to enrich the attention maps' local patterns.…”
Section: Transformer Modelmentioning
confidence: 99%
“…DeepViT suggests establishing crosshead communication in order to regenerate attention maps in order to improve variety at various levels. KVT introduces the k-NN attention to take use of the proximity of picture patches and to disregard noisy tokens by calculating attentions solely for the top-k comparable tokens [37]. Refiner investigates attention expansion in higher-dimensional space and uses convolution to enrich the attention maps' local patterns.…”
Section: Transformer Modelmentioning
confidence: 99%
“…When multi-modality protocols are available, many-to-one translation can be performed to improve reliability in image translation [25]. To do this, SynDiff can be modified to include multiple source modalities as conditioning inputs to its generators [26], [28], [29], [65]. The generators in SynDiff that perform the reverse diffusion steps for denoising were based on convolutional backbones.…”
Section: Discussionmentioning
confidence: 99%
“…The generators in SynDiff that perform the reverse diffusion steps for denoising were based on convolutional backbones. Recent imaging studies have reported that transformer architectures with attention mechanisms offer improved sensitivity to long-range context in medical images during synthesis and beyond [65]- [67]. The strength and importance of contextual representations for progressive denoising remain to be demonstrated.…”
Section: Discussionmentioning
confidence: 99%
“…However, this architecture contains both the encoding and decoding parts of the original Transformer, while ViT uses only its encoding part. Dalmaz et al 21 . develop a generative adversarial model for medical image synthesis named ResViT by combining the local localization power of convolutional operators with the context sensitivity of ViT.…”
Section: Introductionmentioning
confidence: 99%