Temporally Efficient Vision Transformer for Video Instance Segmentation

Yang, Shusheng; Wang, Xinggang; Liu, Yu; Fang, Yuxin; Fang, Jiemin; Liu, Wenyu; Zhao, Xun; Shan, Ying

doi:10.1109/cvpr52688.2022.00290

Cited by 41 publications

(30 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Specifically, they use Mask R-CNN [37] to get frame-level instance categories and masks, then propagate them to the entire video clip. Compared to the propagation-based methods that have a complicated processing pipeline to generate sequence results for multiple video instances, the transformer-based methods dominate the state-of-the-art performance [61,62,63,64,65] recently. Thanks to the strong ability to capture global context, this type of models directly learn to segment mask sequences during training and produce sequence-level predictions in only one-time inference.…”

Section: Video Instance Segmentationmentioning

confidence: 99%

TIVE: A Toolbox for Identifying Video Instance Segmentation Errors

Jia¹,

Yang²,

Zeng³

et al. 2022

Preprint

View full text Add to dashboard Cite

Since first proposed, Video Instance Segmentation(VIS) task has attracted vast researchers' focus on architecture modeling to boost performance. Though great advances achieved in online and offline paradigms, there are still insufficient means to identify model errors and distinguish discrepancies between methods, as well approaches that correctly reflect models' performance in recognizing object instances of various temporal lengths remain barely available. More importantly, as the fundamental model abilities demanded by the task, spatial segmentation and temporal association are still understudied in both evaluation and interaction mechanisms.In this paper, we introduce TIVE, a Toolbox for Identifying Video instance segmentation Errors. By directly operating output prediction files, TIVE defines isolated error types and weights each type's damage to mAP, for the purpose of distinguishing model characters. By decomposing localization quality in spatial-temporal dimensions, model's potential drawbacks on spatial segmentation and temporal association can be revealed. TIVE can also report mAP over instance temporal length for real applications. We conduct extensive experiments by the toolbox to further illustrate how spatial segmentation and temporal association affect each other. We expect the analysis of TIVE can give the researchers more insights, guiding the community to promote more meaningful explorations for video instance segmentation. The proposed toolbox is available at https://github.com/wenhe-jia/TIVE.

show abstract

Section: Video Instance Segmentationmentioning

confidence: 99%

TIVE: A Toolbox for Identifying Video Instance Segmentation Errors

Jia¹,

Yang²,

Zeng³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…of approaches [13,25,47,48,55] divide the whole video into multiple clips with overlapping and process the video clip by clip.…”

Section: Introductionmentioning

confidence: 99%

“…Thanks to the emerging advances in vision transformer architecture [9,33,40,58], recent transformer based VIS works [13,25,47,48,55] follow the second clip-level paradigm and represent each instance as a learned query embedding. Specifically, VisTR [47] is the first approach that applies transformer to VIS.…”

Section: Introductionmentioning

confidence: 99%

Towards Robust Video Instance Segmentation with Temporal-Aware Transformer

Zhang¹,

Shao²,

Dai³

et al. 2023

Preprint

View full text Add to dashboard Cite

Most existing transformer based video instance segmentation methods extract per frame features independently, hence it is challenging to solve the appearance deformation problem. In this paper, we observe the temporal information is important as well and we propose TAFormer to aggregate spatio-temporal features both in transformer encoder and decoder. Specifically, in transformer encoder, we propose a novel spatio-temporal joint multi-scale deformable attention module which dynamically integrates the spatial and temporal information to obtain enriched spatio-temporal features. In transformer decoder, we introduce a temporal self-attention module to enhance the frame level box queries with the temporal relation. Moreover, TAFormer adopts an instance level contrastive loss to increase the discriminability of instance query embeddings. Therefore the tracking error caused by visually similar instances can be decreased. Experimental results show that TAFormer effectively leverages the spatial and temporal information to obtain context-aware feature representation and outperforms state-of-the-art methods.

show abstract

“…Although many papers have proposed various solutions, the most notable performance improvement has been achieved in the recent online methods using image-based backbones [14,33]. These results contradict the common sense that end-to-end semi-online or offline approaches (i.e., [5,13,15,30,32,37]) trained on longer video clips would be better for modeling long-range object relationships. Comparison between current VIS paradigms and our approach.…”

Section: Introductionmentioning

confidence: 99%

“…Underline and bold denote the highest accuracy using ResNet-50 and Swin-L, respectively. † denotes using MsgShifT[37] backbone.…”

mentioning

confidence: 99%

A Generalized Framework for Video Instance Segmentation

Heo¹,

Hwang²,

Hyun³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recently, handling long videos of complex and occluded sequences has emerged as a new challenge in the video instance segmentation (VIS) community. However, existing methods show limitations in addressing the challenge. We argue that the biggest bottleneck in current approaches is the discrepancy between the training and the inference. To effectively bridge the gap, we propose a Generalized framework for VIS, namely GenVIS, that achieves the state-of-the-art performance on challenging benchmarks without designing complicated architectures or extra post-processing. The key contribution of GenVIS is the learning strategy. Specifically, we propose a query-based training pipeline for sequential learning, using a novel target label assignment strategy. To further fill the remaining gaps, we introduce a memory that effectively acquires information from previous states. Thanks to the new perspective, which focuses on building relationships between separate frames or clips, GenVIS can be flexibly executed in both online and semi-online manner. We evaluate our methods on popular VIS benchmarks, YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS), achieving state-of-the-art results. Notably, we greatly outperform the state-of-the-art on the long VIS benchmark (OVIS), improving 5.6 AP with ResNet-50 backbone. Code will be available at https://github.com/miranheo/GenVIS.

show abstract

Temporally Efficient Vision Transformer for Video Instance Segmentation

Cited by 41 publications

References 37 publications

TIVE: A Toolbox for Identifying Video Instance Segmentation Errors

TIVE: A Toolbox for Identifying Video Instance Segmentation Errors

Towards Robust Video Instance Segmentation with Temporal-Aware Transformer

A Generalized Framework for Video Instance Segmentation

Contact Info

Product

Resources

About