ResViT: Residual Vision Transformers for Multimodal Medical Image Synthesis

Dalmaz, Onat; Yurt, Mahmut; Çukur, Tolga

doi:10.1109/tmi.2022.3167808

Cited by 208 publications

(109 citation statements)

References 104 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…DeepViT suggests establishing crosshead communication in order to regenerate attention maps in order to improve variety at various levels. KVT introduces the k-NN attention to take use of the proximity of picture patches and to disregard noisy tokens by calculating attentions solely for the top-k comparable tokens [37]. Refiner investigates attention expansion in higher-dimensional space and uses convolution to enrich the attention maps' local patterns.…”

Section: Transformer Modelmentioning

confidence: 99%

ECG-ViT: A Transformer-Based ECG Classifier for Energy-Constraint Wearable Devices

et al. 2022

View full text Add to dashboard Cite

The advancement in deep learning techniques has helped researchers acquire and process multimodal data signals from different healthcare domains. Now, the focus has shifted towards providing end-to-end solutions, i.e., processing these data and developing models that can be directly implemented on edge devices. To achieve this, the researchers try to solve two problems: (I) reduce the complex feature dependencies and (II) reduce the complexity of the deep learning model without compromising accuracy. In this paper, we focus on the later part of reducing the complexity of the model by using the knowledge distillation framework. We have introduced knowledge distillation on the Vision Transformer model to study the MIT-BIH Arrhythmia Database. A tenfold crossvalidation technique was used to validate the model, and we obtained a 99.7% F1 score and 99.3% accuracy. The model was further tested on the Xilinx Alveo U50 FPGA accelerator, and it is found fit for any low-powered wearable device implementation.

show abstract

Section: Transformer Modelmentioning

confidence: 99%

ECG-ViT: A Transformer-Based ECG Classifier for Energy-Constraint Wearable Devices

et al. 2022

View full text Add to dashboard Cite

show abstract

“…When multi-modality protocols are available, many-to-one translation can be performed to improve reliability in image translation [25]. To do this, SynDiff can be modified to include multiple source modalities as conditioning inputs to its generators [26], [28], [29], [65]. The generators in SynDiff that perform the reverse diffusion steps for denoising were based on convolutional backbones.…”

Section: Discussionmentioning

confidence: 99%

“…The generators in SynDiff that perform the reverse diffusion steps for denoising were based on convolutional backbones. Recent imaging studies have reported that transformer architectures with attention mechanisms offer improved sensitivity to long-range context in medical images during synthesis and beyond [65]- [67]. The strength and importance of contextual representations for progressive denoising remain to be demonstrated.…”

Section: Discussionmentioning

confidence: 99%

Unsupervised Medical Image Translation with Adversarial Diffusion Models

Özbey¹,

Dar²,

Bedel³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Imputation of missing images via source-totarget modality translation can facilitate downstream tasks in medical imaging. A pervasive approach for synthesizing target images involves one-shot mapping through generative adversarial networks (GAN). Yet, GAN models that implicitly characterize the image distribution can suffer from limited sample fidelity and diversity. Here, we propose a novel method based on adversarial diffusion modeling, SynDiff, for improved reliability in medical image synthesis. To capture a direct correlate of the image distribution, Syn-Diff leverages a conditional diffusion process to progressively map noise and source images onto the target image. For fast and accurate image sampling during inference, large diffusion steps are coupled with adversarial projections in the reverse diffusion direction. To enable training on unpaired datasets, a cycle-consistent architecture is devised with two coupled diffusion processes to synthesize the target given source and the source given target. Extensive assessments are reported on the utility of SynDiff against competing GAN and diffusion models in multicontrast MRI and MRI-CT translation. Our demonstrations indicate that SynDiff offers superior performance against competing baselines both qualitatively and quantitatively.

show abstract

“…However, this architecture contains both the encoding and decoding parts of the original Transformer, while ViT uses only its encoding part. Dalmaz et al 21 . develop a generative adversarial model for medical image synthesis named ResViT by combining the local localization power of convolutional operators with the context sensitivity of ViT.…”

Section: Introductionmentioning

confidence: 99%

Vision Transformer‐based recognition of diabetic retinopathy grade

Xiao

et al. 2021

Medical Physics

View full text Add to dashboard Cite

Background: In the domain of natural language processing, Transformers are recognized as state-of -the-art models, which opposing to typical convolutional neural networks (CNNs) do not rely on convolution layers. Instead, Transformers employ multi-head attention mechanisms as the main building block to capture long-range contextual relations between image pixels. Recently, CNNs dominated the deep learning solutions for diabetic retinopathy grade recognition. However, spurred by the advantages of Transformers, we propose a Transformer-based method that is appropriate for recognizing the grade of diabetic retinopathy. Purpose: The purposes of this work are to demonstrate that (i) the pure attention mechanism is suitable for diabetic retinopathy grade recognition and (ii) Transformers can replace traditional CNNs for diabetic retinopathy grade recognition. Methods: This paper proposes a Vision Transformer-based method to recognize the grade of diabetic retinopathy. Fundus images are subdivided into nonoverlapping patches, which are then converted into sequences by flattening, and undergo a linear and positional embedding process to preserve positional information. Then, the generated sequence is input into several multi-head attention layers to generate the final representation. The first token sequence is input to a softmax classification layer to produce the recognition output in the classification stage. Results: The dataset for training and testing employs fundus images of different resolutions, subdivided into patches. We challenge our method against current CNNs and extreme learning machines and achieve an appealing performance. Specifically, the suggested deep learning architecture attains an accuracy of 91.4%, specificity = 0.977 (95% confidence interval (CI) (0.951-1)), precision = 0.928 (95% CI (0.852-1)), sensitivity = 0.926 (95% CI (0.863-0.989)), quadratic weighted kappa score = 0.935, and area under curve (AUC) = 0.986. Conclusion:Our comparative experiments against current methods conclude that our model is competitive and highlight that an attention mechanism based on a Vision Transformer model is promising for the diabetic retinopathy grade recognition task.

show abstract

ResViT: Residual Vision Transformers for Multimodal Medical Image Synthesis

Cited by 208 publications

References 104 publications

ECG-ViT: A Transformer-Based ECG Classifier for Energy-Constraint Wearable Devices

ECG-ViT: A Transformer-Based ECG Classifier for Energy-Constraint Wearable Devices

Unsupervised Medical Image Translation with Adversarial Diffusion Models

Vision Transformer‐based recognition of diabetic retinopathy grade

Contact Info

Product

Resources

About