CAT: Cross Attention in Vision Transformer

Lin, Hezheng; Xing, Chao; Wu, Xiangyu; Yang, Fan; Dong, Shen; Wang, Zhongyuan; Song, Qing; Wang, Yuan

doi:10.48550/arxiv.2106.05786

Cited by 6 publications

(11 citation statements)

References 66 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Due to the time limit, we did not finish the experiment to test the performance of the model with more encoder and decoder layers that are connected by a skip connection going through a ViT Block, or test ViT with different embedding dimensions. Also, inspired by the recent work of innovative Transformer models for image classification or segmentation that reduces computation complexity by special algorithms [19,9,10] , we would also try to introduce them into our model in the future to improve the performance and decrease the computation complexity.…”

Section: Discussionmentioning

confidence: 99%

BiTr-Unet: a CNN-Transformer Combined Network for MRI Brain Tumor Segmentation

Jia,

Shu

2021

Preprint

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) have recently achieved remarkable success in automatically identifying organs or lesions on 3D medical images. Meanwhile, vision transformer networks have exhibited exceptional performance in 2D image classification tasks. Compared with CNNs, transformer networks have an obvious advantage of extracting long-range features due to their self-attention algorithm. Therefore, in this paper we present a CNN-Transformer combined model called BiTr-Unet for brain tumor segmentation on multi-modal MRI scans. The proposed BiTr-Unet achieves good performance on the BraTS 2021 validation dataset with mean Dice score 0.9076, 0.8392 and 0.8231, and mean Hausdorff distance 4.5322, 13.4592 and 14.9963 for the whole tumor, tumor core, and enhancing tumor, respectively.

show abstract

Section: Discussionmentioning

confidence: 99%

BiTr-Unet: a CNN-Transformer Combined Network for MRI Brain Tumor Segmentation

Jia,

Shu

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Swin transformer [34] designs the shifted window-based multi-head attentions to reduce the computation cost. CAT [70] alternately applies attention inner patch and between patches to maintain the performance with lower computational cost and builds a cross attention hierarchical network. Due to the perfect performance of Swin Transformer, it is used as the backbone network.…”

Section: Transformermentioning

confidence: 99%

SwinNet: Swin Transformer Drives Edge-Aware RGB-D and RGB-T Salient Object Detection

Liu

Tan

et al. 2022

IEEE Trans. Circuits Syst. Video Technol.

173

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) are good at extracting contexture features within certain receptive fields, while transformers can model the global long-range dependency features. By absorbing the advantage of transformer and the merit of CNN, Swin Transformer shows strong feature representation ability. Based on it, we propose a cross-modality fusion model, SwinNet, for RGB-D and RGB-T salient object detection.It is driven by Swin Transformer to extract the hierarchical features, boosted by attention mechanism to bridge the gap between two modalities, and guided by edge information to sharp the contour of salient object. To be specific, two-stream Swin Transformer encoder first extracts multi-modality features, and then spatial alignment and channel re-calibration module is presented to optimize intra-level cross-modality features. To clarify the fuzzy boundary, edge-guided decoder achieves inter-level cross-modality fusion under the guidance of edge features. The proposed model outperforms the state-of-theart models on RGB-D and RGB-T datasets, showing that it provides more insight into the cross-modality complementarity task.https://github.com/liuzywen/SwinNet

show abstract

“…To enhance the ability of local feature extraction and retain the non-convolution structure, many works [27][28][29] adapted to the patch structure through the local self-attention mechanism. For example, Swin Transformer limits the attention to one window, which introduces the locality of convolution operation and saves the amount of calculation.…”

Section: Transformer With Local Attention Enhancementmentioning

confidence: 99%

HA-RoadFormer: Hybrid Attention Transformer with Multi-Branch for Large-Scale High-Resolution Dense Road Segmentation

et al. 2022

View full text Add to dashboard Cite

Road segmentation is one of the essential tasks in remote sensing. Large-scale high-resolution remote sensing images originally have larger pixel sizes than natural images, while the existing models based on Transformer have the high computational cost of square complexity, leading to more extended model training and inference time. Inspired by the long text Transformer model, this paper proposes a novel hybrid attention mechanism to improve the inference speed of the model. By calculating several diagonals and random blocks of the attention matrix, hybrid attention achieves linear time complexity in the token sequence. Using the superposition of adjacent and random attention, hybrid attention introduces the inductive bias similar to convolutional neural networks (CNNs) and retains the ability to acquire long-distance dependence. In addition, the dense road segmentation result of remote sensing image still has the problem of insufficient continuity. However, multiscale feature representation is an effective means in the network based on CNNs. Inspired by this, we propose a multi-scale patch embedding module, which divides images by patches with different scales to obtain coarse-to-fine feature representations. Experiments on the Massachusetts dataset show that the proposed HA-RoadFormer could effectively preserve the integrity of the road segmentation results, achieving a higher Intersection over Union (IoU) 67.36% of road segmentation compared to other state-of-the-art (SOTA) methods. At the same time, the inference speed has also been greatly improved compared with other Transformer based models.

show abstract

CAT: Cross Attention in Vision Transformer

Cited by 6 publications

References 66 publications

BiTr-Unet: a CNN-Transformer Combined Network for MRI Brain Tumor Segmentation

BiTr-Unet: a CNN-Transformer Combined Network for MRI Brain Tumor Segmentation

SwinNet: Swin Transformer Drives Edge-Aware RGB-D and RGB-T Salient Object Detection

HA-RoadFormer: Hybrid Attention Transformer with Multi-Branch for Large-Scale High-Resolution Dense Road Segmentation

Contact Info

Product

Resources

About