A Survey of Visual Transformers

Liu, Yang; Zhang, Yao; Wang, Yixin; Feng, Huanqing; Yuan, Jing; Tian, Jing; Zhang, Yang; Shi, Zhongchao; Fan, Jianping; He, Zhiqiang

doi:10.48550/arxiv.2111.06091

Cited by 30 publications

(29 citation statements)

References 108 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, we summarize the previous researches on breast cancer diagnosis in ultrasound images [10], [11] and the transformer-based medical image classification models [12].…”

Section: Related Workmentioning

confidence: 99%

HoVer-Trans: Anatomy-aware HoVer-Transformer for ROI-free Breast Cancer Diagnosis in Ultrasound Images

Mo¹,

Han²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Ultrasonography is an important routine examination for breast cancer diagnosis, due to its non-invasive, radiationfree and low-cost properties. However, it is still not the first-line screening test for breast cancer due to its inherent limitations. It would be a tremendous success if we can precisely diagnose breast cancer by breast ultrasound images (BUS). Many learningbased computer-aided diagnostic methods have been proposed to achieve breast cancer diagnosis/lesion classification. However, most of them require a pre-define ROI and then classify the lesion inside the ROI. Conventional classification backbones, such as VGG16 and ResNet50, can achieve promising classification results with no ROI requirement. But these models lack interpretability, thus restricting their use in clinical practice. In this study, we propose a novel ROI-free model for breast cancer diagnosis in ultrasound images with interpretable feature representations. We leverage the anatomical prior knowledge that malignant and benign tumors have different spatial relationships between different tissue layers, and propose a HoVer-Transformer to formulate this prior knowledge. The proposed HoVer-Trans block extracts the inter-and intra-layer spatial information horizontally and vertically. We conduct and release an open dataset GDPH&GYFYY for breast cancer diagnosis in BUS. The proposed model is evaluated in three datasets by comparing with four CNN-based models and two vision transformer models via a five-fold cross validation. It achieves state-of-the-art classification performance with the best model interpretability.

show abstract

“…In this section, we summarize the previous researches on breast cancer diagnosis in ultrasound images [10], [11] and the transformer-based medical image classification models [12].…”

Section: Related Workmentioning

confidence: 99%

HoVer-Trans: Anatomy-aware HoVer-Transformer for ROI-free Breast Cancer Diagnosis in Ultrasound Images

Mo¹,

Han²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Due to the excellent performance of ViT, many Transformer-based image classi cation models have been proposed, improving ViT from perspectives in ve categories [15]…”

Section: Vit-based Image Classi Cationmentioning

confidence: 99%

MFVT: Multilevel Feature Fusion Vision Transformer and RAMix Data Augmentation for Fine-grained Visual Categorization

Xia

et al. 2022

Preprint

View full text Add to dashboard Cite

The introduction and application of the Vision Transformer (ViT) has promoted the development of fine-grained visual categorization (FGVC). However, there are some problems with directly applying ViT to FGVC tasks. ViT only classifies using the class token in the last layer, ignoring the local and low-level features necessary for FGVC. We propose a ViT-based multilevel feature fusion transformer (MFVT) for FGVC tasks. In this framework, with reference to ViT, the backbone network adopts 12 layers of Transformer blocks, divides it into four stages, and adds multilevel feature fusion (MFF) between Transformer layers. We also design RAMix, a CutMix-based data augmentation strategy that uses the resize strategy for crop-paste images and label assignment based on attention. Experiments on the CUB-200-2011, Stanford Dogs, and iNaturalist 2017 datasets gave competitive results, especially on the challenging iNaturalist 2017, with an accuracy rate of 72.6%.

show abstract

“…These models typically rely on region-based image features extracted a pre-trained object detectors based on commonly used two-staged detectors (typically Faster R-CNN model [28] or its extension Mask-RCNN [29]), or single-stage detectors (typically SSD and YOLO V3 [30]) or anchor-free detectors(e.g., [31]). Another directions are patch embedding [32,33,34,35,36]. This direction of work directly operates on patches (as a sequence of tokens with fixed length).…”

Section: Related Workmentioning

confidence: 99%

Logically at Factify 2022: Multimodal Fact Verification

Gao¹,

Hoffmann²,

Oikonomou³

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper describes our participant system for the multi-modal fact verification (Factify) challenge at AAAI 2022. Despite the recent advance in text based verification techniques and large pre-trained multimodal models cross vision and language, very limited work has been done in applying multimodal techniques to automate fact checking process, particularly considering the increasing prevalence of claims and fake news about images and videos on social media. In our work, the challenge is treated as multimodal entailment task and framed as multi-class classification. Two baseline approaches are proposed and explored including an ensemble model (combining two uni-modal models) and a multimodal attention network (modeling the interaction between image and text pair from claim and evidence document). We conduct several experiments investigating and benchmarking different SoTA pre-trained transformers and vision models in this work. Our best model is ranked first in leaderboard which obtains a weighted average F-measure of 0.77 on both validation and test set. Exploratory analysis of dataset is also carried out on the Factify data set and uncovers salient patterns and issues (e.g., word overlapping, visual entailment correlation, source bias) that motivates our hypothesis. Finally, we highlight challenges of the task and multimodal dataset for future research.

show abstract

A Survey of Visual Transformers

Cited by 30 publications

References 108 publications

HoVer-Trans: Anatomy-aware HoVer-Transformer for ROI-free Breast Cancer Diagnosis in Ultrasound Images

HoVer-Trans: Anatomy-aware HoVer-Transformer for ROI-free Breast Cancer Diagnosis in Ultrasound Images

MFVT: Multilevel Feature Fusion Vision Transformer and RAMix Data Augmentation for Fine-grained Visual Categorization

Logically at Factify 2022: Multimodal Fact Verification

Contact Info

Product

Resources

About