UniFormer: Unifying Convolution and Self-attention for Visual Recognition

Li, Kunchang; Wang, Yali; Zhang, Junhao; Gao, Peng; Song, Guanglu; Liu, Yu; Li, Hongsheng; Qiao, Yu

doi:10.48550/arxiv.2201.09450

Cited by 32 publications

(54 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, Transformer [50] has attracted the attention of computer vision community due to its success in the field of natural language processing. A series of Transformer-based methods [13,27,56,51,36,18,12,6,57,60,25,42] have been developed for high-level vision tasks, including image classification [36,13,27,44,49], object detection [34,48,36,4,6], segmentation [55,51,16,2], etc. Although vision Transformer has shown its superiority on modeling long-range dependency [13,43], there are still many works demonstrating that the convolution can help Transformer achieve better visual representation [56,58,61,60,25].…”

Section: Vision Transformermentioning

confidence: 99%

“…A series of Transformer-based methods [13,27,56,51,36,18,12,6,57,60,25,42] have been developed for high-level vision tasks, including image classification [36,13,27,44,49], object detection [34,48,36,4,6], segmentation [55,51,16,2], etc. Although vision Transformer has shown its superiority on modeling long-range dependency [13,43], there are still many works demonstrating that the convolution can help Transformer achieve better visual representation [56,58,61,60,25]. Due to the impressive performance, Transformer has also been introduced for low-level vision tasks [5,54,37,29,3,62,28,26].…”

Section: Vision Transformermentioning

confidence: 99%

“…The shallow feature extraction can simply map the input from low-dimensional space to high-dimensional space, while achieving the highdimensional embedding for each pixel token. Moreover, the early convolutional layer can help learn better visual representation [25] and lead to stable optimization [58]. We then perform deep feature extraction H DF (•) to further obtain the deep feature F DF ∈ R H×W ×C as…”

Section: Motivationmentioning

confidence: 99%

“…2(a), more pixels are activated when channel attention is adopted, since global information is involved to calculate the channel attention weights. Besides, many works illustrate that convolution can help Transformer get better visual representation or achieve easier optimization [56,58,60,25,68]. Therefore, we incorporate a channel attentionbased convolution block into the standard Transformer block to further enhance the representation ability of the network.…”

Section: Hybrid Attention Block (Hab)mentioning

confidence: 99%

See 3 more Smart Citations

Activating More Pixels in Image Super-Resolution Transformer

Chen¹,

Wang²,

Zhou³

et al. 2022

Preprint

View full text Add to dashboard Cite

Transformer-based methods have shown impressive performance in low-level vision tasks, such as image super-resolution. However, we find that these networks can only utilize a limited spatial range of input information through attribution analysis. This implies that the potential of Transformer is still not fully exploited in existing networks. In order to activate more input pixels for reconstruction, we propose a novel Hybrid Attention Transformer (HAT). It combines channel attention and self-attention schemes, thus making use of their complementary advantages. Moreover, to better aggregate the cross-window information, we introduce an overlapping cross-attention module to enhance the interaction between neighboring window features. In the training stage, we additionally propose a same-task pre-training strategy to bring further improvement. Extensive experiments show the effectiveness of the proposed modules, and the overall method significantly outperforms the state-of-the-art methods by more than 1dB. Codes and models will be available at https://github.com/chxy95/HAT.

show abstract

Section: Vision Transformermentioning

confidence: 99%

Section: Vision Transformermentioning

confidence: 99%

Section: Motivationmentioning

confidence: 99%

Section: Hybrid Attention Block (Hab)mentioning

confidence: 99%

See 2 more Smart Citations

Activating More Pixels in Image Super-Resolution Transformer

Chen¹,

Wang²,

Zhou³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Recently, SETR [27] and Segmenter [58] directly adopt vision transformers [22], [23] as the backbone, which capture global context from very early layers. SegFormer [59], PVT [60], Swin [24], and UniFormer [61] create hierarchical structures to make use of multi-resolution features. Leveraging the advance of DETR [62], MaX-DeepLab [63] and MaskFormer [64] view image segmentation from the perspective of mask classification.…”

Section: Transformer-driven Semantic Segmentationmentioning

confidence: 99%

CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers

Liu¹,

Zhang²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

The performance of semantic segmentation of RGB images can be advanced by exploiting informative features from supplementary modalities. In this work, we propose CMX, a vision-transformer-based cross-modal fusion framework for RGB-X semantic segmentation. To generalize to different sensing modalities encompassing various uncertainties, we consider that comprehensive crossmodal interactions should be provided. CMX is built with two streams to extract features from RGB images and the complementary modality (X-modality). In each feature extraction stage, we design a Cross-Modal Feature Rectification Module (CM-FRM) to calibrate the feature of the current modality by combining the feature from the other modality, in spatial-and channel-wise dimensions. With rectified feature pairs, we deploy a Feature Fusion Module (FFM) to mix them for the final semantic prediction. FFM is constructed with a cross-attention mechanism, which enables exchange of long-range contexts, enhancing both modalities' features at a global level. Extensive experiments show that CMX generalizes to diverse multi-modal combinations, achieving state-of-the-art performances on four RGB-Depth benchmarks, as well as RGB-Thermal and RGB-Polarization datasets. Besides, to investigate the generalizability to dense-sparse data fusion, we establish a RGB-Event semantic segmentation benchmark based on the EventScape dataset, on which CMX sets the new state-of-the-art. Code is available at https://github.com/huaaaliu/RGBX Semantic Segmentation

show abstract

Multi-attention fusion transformer for single-image super-resolution

Li,

Cui,

et al. 2024

Sci Rep

View full text Add to dashboard Cite

Recently, Transformer-based methods have gained prominence in image super-resolution (SR) tasks, addressing the challenge of long-range dependence through the incorporation of cross-layer connectivity and local attention mechanisms. However, the analysis of these networks using local attribution maps has revealed significant limitations in leveraging the spatial extent of input information. To unlock the inherent potential of Transformer in image SR, we propose the Multi-Attention Fusion Transformer (MAFT), a novel model designed to integrate multiple attention mechanisms with the objective of expanding the number and range of pixels activated during image reconstruction. This integration enhances the effective utilization of input information space. At the core of our model lies the Multi-attention Adaptive Integration Groups, which facilitate the transition from dense local attention to sparse global attention through the introduction of Local Attention Aggregation and Global Attention Aggregation blocks with alternating connections, effectively broadening the network's receptive field. The effectiveness of our proposed algorithm has been validated through comprehensive quantitative and qualitative evaluation experiments conducted on benchmark datasets. Compared to state-of-the-art methods (e.g. HAT), the proposed MAFT achieves 0.09 dB gains on Urban100 dataset for × 4 SR task while containing 32.55% and 38.01% fewer parameters and FLOPs, respectively.

show abstract

UniFormer: Unifying Convolution and Self-attention for Visual Recognition

Cited by 32 publications

References 0 publications

Activating More Pixels in Image Super-Resolution Transformer

Activating More Pixels in Image Super-Resolution Transformer

CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers

Multi-attention fusion transformer for single-image super-resolution

Contact Info

Product

Resources

About