MaxViT: Multi-Axis Vision Transformer

Tu, Zhengzhong; Talebi, Hossein; Zhang, Han; Feng, Yijun; Milanfar, Peyman; Bovik, Alan C.; Li, Yinxiao

doi:10.48550/arxiv.2204.01697

Cited by 26 publications

(27 citation statements)

References 61 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In computer vision, nonlocal neural networks [33] also show that adding self-attention layer after convolution layers enables the model to capture more global information and improves the performance on various vision tasks. Recently, a series of vision Transformer variants that apply convolution and self-attention sequentially are also proposed including CvT [34], CoAtNet [35], ViTAEv2 [36], MaxVit [37]. In speech, Gulati et al [28] introduce Conformer models for ASR and show that adding a convolution block after the self-attention block achieves the best performance compared to applying it before or in parallel with the self-attention.…”

Section: Sequentiallymentioning

confidence: 99%

E-Branchformer: Branchformer with Enhanced merging for speech recognition

Kim¹,

Wu²,

Peng³

et al. 2022

Preprint

View full text Add to dashboard Cite

Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other studies have explored integrating convolution and self-attention but they have not managed to match Conformer's performance. The recently introduced Branchformer achieves comparable performance to Conformer by using dedicated branches of convolution and self-attention and merging local and global context from each branch. In this paper, we propose E-Branchformer, which enhances Branchformer by applying an effective merging method and stacking additional point-wise modules. E-Branchformer sets new state-of-the-art word error rates (WERs) 1.81% and 3.65% on LibriSpeech test-clean and test-other sets without using any external training data.

show abstract

Section: Sequentiallymentioning

confidence: 99%

E-Branchformer: Branchformer with Enhanced merging for speech recognition

Kim¹,

Wu²,

Peng³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Transformer [76] is the de-facto standard architecture in natural language processing. Recently, it has been used in dealing with vision problems by viewing pixels or image patches as tokens [6,16], achieving remarkable performance gains in various computer vision tasks, including image classification [16,36,50,73], object detection [75,49,83], semantic segmentation [82,15,65], etc.…”

Section: Vision Transformermentioning

confidence: 99%

Recurrent Video Restoration Transformer with Guided Deformable Attention

Liang¹,

Yi²,

Xiang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Video restoration aims at restoring multiple high-quality frames from multiple lowquality frames. Existing video restoration methods generally fall into two extreme cases, i.e., they either restore all frames in parallel or restore the video frame by frame in a recurrent way, which would result in different merits and drawbacks. Typically, the former has the advantage of temporal information fusion. However, it suffers from large model size and intensive memory consumption; the latter has a relatively small model size as it shares parameters across frames; however, it lacks long-range dependency modeling ability and parallelizability. In this paper, we attempt to integrate the advantages of the two cases by proposing a recurrent video restoration transformer, namely RVRT. RVRT processes local neighboring frames in parallel within a globally recurrent framework which can achieve a good trade-off between model size, effectiveness, and efficiency. Specifically, RVRT divides the video into multiple clips and uses the previously inferred clip feature to estimate the subsequent clip feature. Within each clip, different frame features are jointly updated with implicit feature aggregation. Across different clips, the guided deformable attention is designed for clip-to-clip alignment, which predicts multiple relevant locations from the whole inferred clip and aggregates their features by the attention mechanism. Extensive experiments on video super-resolution, deblurring, and denoising show that the proposed RVRT achieves state-of-the-art performance on benchmark datasets with balanced model size, testing memory and runtime. The codes are available at https://github.com/JingyunLiang/RVRT.Preprint. Under review.

show abstract

“…CNNs [43][44][45][46][47] are the de-facto model for vision tasks due to their outstanding ability to model local dependency [47][48][49] as well as extract high-frequency [19]. With these advantages, CNNs are rapidly introduced into Transformers in a serial or parallel manner [23][24][25][26][50][51][52]. For serial methods, convolutions are applied at different positions of the Transformer.…”

Section: Related Workmentioning

confidence: 99%

Inception Transformer

Si¹,

Yu²,

Zhou³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recent studies show that Transformer has strong capability of building long-range dependencies, yet is incompetent in capturing high frequencies that predominantly convey local information. To tackle this issue, we present a novel and general-purpose Inception Transformer, or iFormer for short, that effectively learns comprehensive features with both high-and low-frequency information in visual data. Specifically, we design an Inception mixer to explicitly graft the advantages of convolution and max-pooling for capturing the high-frequency information to Transformers. Different from recent hybrid frameworks, the Inception mixer brings greater efficiency through a channel splitting mechanism to adopt parallel convolution/max-pooling path and self-attention path as high-and low-frequency mixers, while having the flexibility to model discriminative information scattered within a wide frequency range. Considering that bottom layers play more roles in capturing high-frequency details while top layers more in modeling low-frequency global information, we further introduce a frequency ramp structure, i.e., gradually decreasing the dimensions fed to the high-frequency mixer and increasing those to the low-frequency mixer, which can effectively trade-off high-and lowfrequency components across different layers. We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation. For example, our iFormer-S hits the top-1 accuracy of 83.4% on ImageNet-1K, much higher than DeiT-S by 3.6%, and even slightly better than much bigger model Swin-B (83.3%) with only 1/4 parameters and 1/3 FLOPs. Code and models will be released at https://github.com/sail-sg/iFormer. * Equal contribution. Weihao Yu did this work during an internship at Sea AI Lab.Preprint. Under review.

show abstract

MaxViT: Multi-Axis Vision Transformer

Cited by 26 publications

References 61 publications

E-Branchformer: Branchformer with Enhanced merging for speech recognition

E-Branchformer: Branchformer with Enhanced merging for speech recognition

Recurrent Video Restoration Transformer with Guided Deformable Attention

Inception Transformer

Contact Info

Product

Resources

About