Occluded Video Instance Segmentation: A Benchmark

Qi, Jiyang; Gao, Yan; Hu, Yao; Wang, Xinggang; Liu, Xiaoyu; Bai, Xiang; Belongie, Serge; Yuille, Alan; Torr, Philip H. S.; Bai, Song

doi:10.48550/arxiv.2102.01558

Cited by 19 publications

(42 citation statements)

References 40 publications

(79 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, more challenging benchmarks such as OVIS [51] and YouTube-VIS-2021 [71] are proposed to further promote the advancement of this field. CrossVIS is evaluated on three VIS benchmarks and shows competitive performances.…”

Section: Related Workmentioning

confidence: 99%

Crossover Learning for Fast Online Video Instance Segmentation

Yang¹,

Fang²,

Wang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Modeling temporal visual context across frames is critical for video instance segmentation (VIS) and other video understanding tasks. In this paper, we propose a fast online VIS model named CrossVIS. For temporal information modeling in VIS, we present a novel crossover learning scheme that uses the instance feature in the current frame to pixel-wisely localize the same instance in other frames. Different from previous schemes, crossover learning does not require any additional network parameters for feature enhancement. By integrating with the instance segmentation loss, crossover learning enables efficient crossframe instance-to-pixel relation learning and brings costfree improvement during inference. Besides, a global balanced instance embedding branch is proposed for more accurate and more stable online instance association. We conduct extensive experiments on three challenging VIS benchmarks, i.e., YouTube-VIS-2019, OVIS, and YouTube-VIS-2021 to evaluate our methods. To our knowledge, CrossVIS achieves state-of-the-art performance among all online VIS methods and shows a decent trade-off between latency and accuracy. Code will be available to facilitate future research.

show abstract

Section: Related Workmentioning

confidence: 99%

Crossover Learning for Fast Online Video Instance Segmentation

Yang¹,

Fang²,

Wang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Instance Segmentation in Videos: Multi-instance Segmentation in Videos has recently emerged as a popular field due to its applicability in autonomous driving and robotics. Some of the popular tasks in this domain are Video Object Segmentation (VOS) [6,29], Video Instance Segmentation (VIS) [45], and the more recent Occluded Video Instance Segmentation (OVIS) [32]. Here the primary goal is to segment all object instances in a video and associate them over time.…”

Section: Related Workmentioning

confidence: 99%

“…OVIS. Occluded Video Instance Segmentation [32] comprises 5,233 videos with labeled masks for 25 known object classes. The dataset is similar to YouTube-VIS in that it also uses mean Average Precision (mAP) as the evaluation measure, but is more challenging since it comprises longer videos where objects undergo significant occlusion.…”

Section: Benchmarksmentioning

confidence: 99%

“…• We propose a novel D 2 Conv3D operator which can be used as drop-in replacements for standard convolutions in 3D CNNs to improve their performance on video segmentation tasks. • We experimentally justify the efficacy of D 2 Conv3D by applying it to two different 3D CNN based architectures [1,28] and evaluating them on five different benchmarks [29,6,45,40,32]. • We set a new state-of-the-art on the DAVIS 2016 Unsupervised challenge [29] by achieving a J &F score of 86.0%.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

D^2Conv3D: Dynamic Dilated Convolutions for Object Segmentation in Videos

Schmidt¹,

Athar²,

Mahadevan³

et al. 2021

Preprint

View full text Add to dashboard Cite

Despite receiving significant attention from the research community, the task of segmenting and tracking objects in monocular videos still has much room for improvement. Existing works have simultaneously justified the efficacy of dilated and deformable convolutions for various image-level segmentation tasks. This gives reason to believe that 3D extensions of such convolutions should also yield performance improvements for video-level segmentation tasks. However, this aspect has not yet been explored thoroughly in existing literature. In this paper, we propose Dynamic Dilated Convolutions (D 2 Conv3D): a novel type of convolution which draws inspiration from dilated and deformable convolutions and extends them to the 3D (spatio-temporal) domain. We experimentally show that D 2 Conv3D can be used to improve the performance of multiple 3D CNN architectures across multiple video segmentation related benchmarks by simply employing D 2 Conv3D as a drop-in replacement for standard convolutions. We further show that D 2 Conv3D out-performs trivial extensions of existing dilated and deformable convolutions to 3D. Lastly, we set a new state-ofthe-art on the DAVIS 2016 Unsupervised Video Object Segmentation benchmark. Code is made publicly available at https://github.com/Schmiddo/d2conv3d.

show abstract

“…For video object identification, we require video object sequences where objects are associated across multiple frames. Hence, to train and evaluate our proposed approach, we used four video instance segmentation datasets: YouTube Video Instance Segmentation (YT-VIS) [51], Unidentified Video Objects (UVO) [47], Occluded Video Instance Segmentation (OVIS) [34], and Tracking Any Object with Video Object Segmentation (TAO-VOS) [8,43]. All these datasets contain a large object vocabulary and various challenging scenarios, including perceptually-aliased occluded objects, as described below:…”

Section: Datasetsmentioning

confidence: 99%

AirObject: A Temporally Evolving Graph Embedding for Object Identification

Keetha¹,

Wang²,

Qiu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Object encoding and identification are vital for robotic tasks such as autonomous exploration, semantic scene understanding, and re-localization. Previous approaches have attempted to either track objects or generate descriptors for object identification. However, such systems are limited to a "fixed" partial object representation from a single viewpoint. In a robot exploration setup, there is a requirement for a temporally "evolving" global object representation built as the robot observes the object from multiple viewpoints. Furthermore, given the vast distribution of unknown novel objects in the real world, the object identification process must be class-agnostic. In this context, we propose a novel temporal 3D object encoding approach, dubbed AirObject, to obtain global keypoint graph-based embeddings of objects. Specifically, the global 3D object embeddings are generated using a temporal convolutional network across structural information of multiple frames obtained from a graph attention-based encoding method. We demonstrate that AirObject achieves the state-of-the-art performance for video object identification and is robust to severe occlusion, perceptual aliasing, viewpoint shift, deformation, and scale transform, outperforming the state-ofthe-art single-frame and sequential descriptors. To the best of our knowledge, AirObject is one of the first temporal object encoding methods.

show abstract

Occluded Video Instance Segmentation: A Benchmark

Cited by 19 publications

References 40 publications

Crossover Learning for Fast Online Video Instance Segmentation

Crossover Learning for Fast Online Video Instance Segmentation

D^2Conv3D: Dynamic Dilated Convolutions for Object Segmentation in Videos

AirObject: A Temporally Evolving Graph Embedding for Object Identification

Contact Info

Product

Resources

About