TT-ViT: Vision Transformer Compression Using Tensor-Train Decomposition

Minh, Hoang Pham; Xuan, Nguyen Nguyen; Thai, Son Tran

doi:10.1007/978-3-031-16014-1_59

Cited by 3 publications

(1 citation statement)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Image Pre-processing: The original size of input images is 600x400 and consists of red, green, and blue (RGB) color. These images are first pre-processed through three stages namely: reshape [31], rescale [32], and then conversion into tensors [33]. The images are reshaped to 224x224 dimensions (shown in Figure 3 In the next step all the images were rescaled, in this process each pixel of input is rescaled from initial range of 0-255 to 0-1 by dividing every pixel value to 255 as given in equation.…”

Section: Dataset and Environment Setupmentioning

confidence: 99%

Vision Transformer for Skin Cancer Identification Based on Contrastive Learning and Adaptive-Scale Fragmentation

Naeem,

Yang,

Sharif

et al. 2024

Preprint

View full text Add to dashboard Cite

The approach of image processing and deep learning has shown to be a breakthrough in the field of medical image diagnosis such as dermoscopic image analysis for skin cancer recognition and their classification. Skin cancer cases are increasing every year and pose a significant threat for health. In recent studies, convolutional neural network (CNN) has accomplished remarkable success in classifying skin cancer images. CNN is limited to extracting features from minor objects from input dermoscopic image and fails to pinpoint significant regions. Consequently, the researchers of this study have utilized vision transformers (VIT), known for their robust performance in conventional classification assignments. The self-attention mechanism (SAM) aims to enhance the significance of pivotal characteristics while modifying the influence of noise-inducing features. Specifically, an enhanced transformer network architecture has been introduced in this context. To assess its effectiveness, several enhancements have been applied to the model. Initially, a ViT network is implemented to evaluate its efficacy in identifying skin cancer. Subsequently, Adaptive-scale image fragmentation is utilized to sequentially process the image, emphasizing adaptive-scale features through patch embedding. Furthermore, contrastive learning is employed to ensure that similar skin cancer data is encoded differently, aiming for distinct encoding outcomes for different data. Skin cancer dataset namely ISIC 2019 is retrieved in this study, locally accessible at Kaggle’s official website. This dataset consists of dermoscopic images of skin cancer having several types: dermatofibroma, melanoma, actinic keratosis, basal cell carcinoma, nevus, vascular lesion, and pigmented benign keratosis. The ViT model has achieved 99.66% accuracy, 94.85% precision, 93.74% recall, and 94.52% f1-score. Three deep learning models Inception V3, MobileNet, and ResNet-50 were also applied with transfer learning approach as comparison to proposed ViT model for performance evaluation that resulted in 72%, 94.3, and 89% accuracies, respectively. The transformer network has shown remarkable success in natural language processing and in the domain of image analysis. These achievements establish a solid groundwork to classify skin cancer using multimodal data. This paper is confident to captivate the attention of medical researchers, computer engineers, dermatologists, and scholars across various related disciplines. Its insights promise to offer enhanced convenience for patients in their respective fields.

show abstract

Section: Dataset and Environment Setupmentioning

confidence: 99%