Visual Transformers: Token-based Image Representation and Processing for Computer Vision

Wu, Bichen; Xu, Chenfeng; Dai, Xiaoliang; Wan, Alvin; Zhang, Peizhao; Tomizuka, Masayoshi; Keutzer, Kurt; Vajda, Péter

doi:10.48550/arxiv.2006.03677

Cited by 155 publications

(171 citation statements)

References 39 publications

Supporting

Mentioning

170

Contrasting

Order By: Relevance

“…Transformer is firstly proposed by [63] for machine translation. Recently, Transformer has achieved great success in high-level vision, such as image classification [1,16,17,44,69], semantic segmentation [7,44,69,80], human pose estimation [5,6,39,41,46,70], object detection [9,14,30,44,82], etc. Due to the advantage of capturing long-range dependencies and excellent performance in many high-level vision tasks, Transformer has also been introduced into low-level vision [8,10,42,67].…”

Section: Vision Transformermentioning

confidence: 99%

RFormer: Transformer-based Generative Adversarial Network for Real Fundus Image Restoration on A New Clinical Benchmark

Cai¹,

Chen²,

Gong³

et al. 2022

Preprint

View full text Add to dashboard Cite

Ophthalmologists have used fundus images to screen and diagnose eye diseases. However, different equipments and ophthalmologists pose large variations to the quality of fundus images. Low-quality (LQ) degraded fundus images easily lead to uncertainty in clinical screening and generally increase the risk of misdiagnosis. Thus, real fundus image restoration is worth studying. Unfortunately, real clinical benchmark has not been explored for this task so far. In this paper, we investigate the real clinical fundus image restoration problem. Firstly, We establish a clinical dataset, Real Fundus (RF), including 120 low-and high-quality (HQ) image pairs. Then we propose a novel Transformer-based Generative Adversarial Network (RFormer) to restore the real degradation of clinical fundus images. The key component in our network is the Window-based Self-Attention Block (WSAB) which captures non-local self-similarity and long-range dependencies. To produce more visually pleasant results, a Transformer-based discriminator is introduced. Extensive experiments on our clinical benchmark show that the proposed RFormer significantly outperforms the state-of-the-art (SOTA) methods. In addition, experiments of downstream tasks such as vessel segmentation and optic disc/cup detection demonstrate that our proposed RFormer benefits clinical fundus image analysis and applications. The dataset, code, and models will be released.

show abstract

Section: Vision Transformermentioning

confidence: 99%

RFormer: Transformer-based Generative Adversarial Network for Real Fundus Image Restoration on A New Clinical Benchmark

Cai¹,

Chen²,

Gong³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…As a special case, the patches can have spatial size 1x1, which means that the input sequence is obtained by simply mechanism, which grows quadratically with the feature resolutions. Similar to previous attention-based methods [14], [22], [95], some methods attempt to insert Transformer into CNN backbones or replace part of convolution blocks with Transformer layers [43], [44].…”

Section: Vision Transformer (Vit)mentioning

confidence: 99%

“…VTs: Considering that Convolution equivalently matches all pixels regardless of their priority, Visual Transformer (VT) [43] decouples semantic concepts of the input image into different channels and relates them densely through Transformer encoder blocks. In detail, a VT-block consists of three parts.…”

Section: Vision Transformer (Vit)mentioning

confidence: 99%

A Survey of Visual Transformers

Liu¹,

Zhang²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Transformer, an attention-based encoder-decoder architecture, has revolutionized the field of natural language processing. Inspired by this significant achievement, some pioneering works have recently been done on adapting Transformerliked architectures to Computer Vision (CV) fields, which have demonstrated their effectiveness on various CV tasks. Relying on competitive modeling capability, visual Transformers have achieved impressive performance on multiple benchmarks such as ImageNet, COCO and ADE20k as compared with modern Convolution Neural Networks (CNN). In this paper, we have provided a comprehensive review of over one hundred different visual Transformers for three fundamental CV tasks (classification, detection, and segmentation), where a taxonomy is proposed to organize these methods according to their motivations, structures, and usage scenarios. Because of the differences in training settings and oriented tasks, we have also evaluated these methods on different configurations for easy and intuitive comparison instead of only various benchmarks. Furthermore, we have revealed a series of essential but unexploited aspects that may empower Transformer to stand out from numerous architectures, e.g., slack high-level semantic embeddings to bridge the gap between visual and sequential Transformers. Finally, three promising future research directions are suggested for further investment.

show abstract

“…Transformer [90] has achieved great success in Natural Language Process and it has been applied in multiple computer vision tasks such as image recognition [20,94] and [20] to aggregate the sequential feature maps. Note that there is only an encoder but no decoder in this module since we only care about the high-level procedure features instead of the specific feature for each step after decoding.…”

Section: Vision Transformermentioning

confidence: 99%

SVIP: Sequence VerIfication for Procedures in Videos

Qian¹,

Luo²,

Lian³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we propose a novel sequence verification task that aims to distinguish positive video pairs performing the same action sequence from negative ones with step-level transformations but still conducting the same task. Such a challenging task resides in an open-set setting without prior action detection or segmentation that requires event-level or even frame-level annotations. To that end, we carefully reorganize two publicly available action-related datasets with step-procedure-task structure. To fully investigate the effectiveness of any method, we collect a scripted video dataset enumerating all kinds of step-level transformations in chemical experiments. Besides, a novel evaluation metric Weighted Distance Ratio is introduced to ensure equivalence for different step-level transformations during evaluation. In the end, a simple but effective baseline based on the transformer with a novel sequence alignment loss is introduced to better characterize long-term dependency between steps, which outperforms other action recognition methods. Codes and data will be released.

show abstract

Visual Transformers: Token-based Image Representation and Processing for Computer Vision

Cited by 155 publications

References 39 publications

RFormer: Transformer-based Generative Adversarial Network for Real Fundus Image Restoration on A New Clinical Benchmark

RFormer: Transformer-based Generative Adversarial Network for Real Fundus Image Restoration on A New Clinical Benchmark

A Survey of Visual Transformers

SVIP: Sequence VerIfication for Procedures in Videos

Contact Info

Product

Resources

About