2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.00863
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Video Instance Segmentation with Transformers

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
261
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 469 publications
(265 citation statements)
references
References 17 publications
0
261
0
Order By: Relevance
“…In Transformer-based models, a large amount of works use CNNs as either the encoder [29], [30], [31], [32], [33] or decoder [34], [35], [36], [37], [38] to capture the fine details.…”
Section: Spatial Domain Learningmentioning
confidence: 99%
“…In Transformer-based models, a large amount of works use CNNs as either the encoder [29], [30], [31], [32], [33] or decoder [34], [35], [36], [37], [38] to capture the fine details.…”
Section: Spatial Domain Learningmentioning
confidence: 99%
“…Recently, pioneer works such as ViT [50] and DETR [76], proposed to utilize transformers to solve vision problems, by representing images as sequences of patches. It has been shown that transformers are effective in tasks such as image classification [51,77], object detection [76], semantic/instance segmentation [45], and video segmentation [78]. Specially, ViT [50] proposed to cut the image into patches, which are then converted to sequences of features and used as inputs to the standard transformers.…”
Section: Vision Transformermentioning
confidence: 99%
“…Transformer [47] is an effective sequence-to-sequence modeling network, and it has achieved stateof-the-art results in NLP tasks with the success of BERT [15]. Due to its great success, it has also be exploited in computer vision community, and the 'CNN + Transformer' becomes a popular paradigm [3,49,7,62,31,32,21]. ViT [16] leads the other trend to use pure transformer for vision tasks [23,30,54] by dividing the images into patch embedding sequences and feeding them into standard transformers.…”
Section: Transformer For Visionmentioning
confidence: 99%