2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.00359
|View full text |Cite
|
Sign up to set email alerts
|

Rethinking Transformer-based Set Prediction for Object Detection

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
86
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 236 publications
(87 citation statements)
references
References 20 publications
1
86
0
Order By: Relevance
“…In contrast, to learn the global information, Wu et al [40] proposed a self-attention network for MR imaging and Feng et al [7] designed a transformer network for MR imaging. However, the selfattention [3,14,36] and transformer [21,33] have high computation complexity and occupy a huge number of GPU memory. Moreover, the transformer [33,35] is difficult to optimize and requires a large-scale training dataset.…”
Section: Mri Reconstructionmentioning
confidence: 99%
See 1 more Smart Citation
“…In contrast, to learn the global information, Wu et al [40] proposed a self-attention network for MR imaging and Feng et al [7] designed a transformer network for MR imaging. However, the selfattention [3,14,36] and transformer [21,33] have high computation complexity and occupy a huge number of GPU memory. Moreover, the transformer [33,35] is difficult to optimize and requires a large-scale training dataset.…”
Section: Mri Reconstructionmentioning
confidence: 99%
“…However, the selfattention [3,14,36] and transformer [21,33] have high computation complexity and occupy a huge number of GPU memory. Moreover, the transformer [33,35] is difficult to optimize and requires a large-scale training dataset. Different from them, in this work, the proposed Spatial and Fourier Layer (SFL) can simultaneously learn the local and global information of the feature, while the required memory and the computation complexity of SFL based on the Fast Fourier Transform are similar to the convolution layer.…”
Section: Mri Reconstructionmentioning
confidence: 99%
“…Recent works such as Swin Transformer [37], PVT [54] and CrossViT [5] improve architec-ture from different aspects. Furthermore, other researchers apply vision transformers to downstream tasks like semantic segmentation [60,62], object detection [3,51] and multimodal tasks [23]. Meanwhile, more detailed comparisons between vision Transformer and CNNs are investigated to show their relative strength and weakness [10,14,45,64].…”
Section: Vision Transformersmentioning
confidence: 99%
“…Then, the Vision Transformer (ViT) [35] utilized a pure Transformer framework to deal with vision tasks, treating an image as a collection of spatial patches. Recently, Transformers have achieved excellent outcomes on a variety of vision tasks [13-15, 42, 45, 46], including image recognition [15,[45][46][47][48], semantic segmentation [42], and object detection [13,14]. On semantic medical image segmentation, Transformer-combined architectures can be divided into two categories: the main one adopts self-attention like operations to complement CNNs [1,[49][50][51]; the other uses pure Transformers to constitute encoderdecoder architectures so as to capture deep representations and predict the class of each image pixel [42][43][44]53].…”
Section: Introductionmentioning
confidence: 99%