2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.00290
|View full text |Cite
|
Sign up to set email alerts
|

Temporally Efficient Vision Transformer for Video Instance Segmentation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
23
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 41 publications
(30 citation statements)
references
References 37 publications
0
23
0
Order By: Relevance
“…Specifically, they use Mask R-CNN [37] to get frame-level instance categories and masks, then propagate them to the entire video clip. Compared to the propagation-based methods that have a complicated processing pipeline to generate sequence results for multiple video instances, the transformer-based methods dominate the state-of-the-art performance [61,62,63,64,65] recently. Thanks to the strong ability to capture global context, this type of models directly learn to segment mask sequences during training and produce sequence-level predictions in only one-time inference.…”
Section: Video Instance Segmentationmentioning
confidence: 99%
“…Specifically, they use Mask R-CNN [37] to get frame-level instance categories and masks, then propagate them to the entire video clip. Compared to the propagation-based methods that have a complicated processing pipeline to generate sequence results for multiple video instances, the transformer-based methods dominate the state-of-the-art performance [61,62,63,64,65] recently. Thanks to the strong ability to capture global context, this type of models directly learn to segment mask sequences during training and produce sequence-level predictions in only one-time inference.…”
Section: Video Instance Segmentationmentioning
confidence: 99%
“…of approaches [13,25,47,48,55] divide the whole video into multiple clips with overlapping and process the video clip by clip.…”
Section: Introductionmentioning
confidence: 99%
“…Thanks to the emerging advances in vision transformer architecture [9,33,40,58], recent transformer based VIS works [13,25,47,48,55] follow the second clip-level paradigm and represent each instance as a learned query embedding. Specifically, VisTR [47] is the first approach that applies transformer to VIS.…”
Section: Introductionmentioning
confidence: 99%
“…Although many papers have proposed various solutions, the most notable performance improvement has been achieved in the recent online methods using image-based backbones [14,33]. These results contradict the common sense that end-to-end semi-online or offline approaches (i.e., [5,13,15,30,32,37]) trained on longer video clips would be better for modeling long-range object relationships. Comparison between current VIS paradigms and our approach.…”
Section: Introductionmentioning
confidence: 99%
“…Underline and bold denote the highest accuracy using ResNet-50 and Swin-L, respectively. † denotes using MsgShifT[37] backbone.…”
mentioning
confidence: 99%