Joint Inductive and Transductive Learning for Video Object Segmentation

Mao, Yunyao; Wang, Ning; Zhou, Wengang; Li, Houqiang

doi:10.1109/iccv48922.2021.00953

Cited by 77 publications

(30 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(i) Automatic video object segmentation (VOS) segments objects that are visually and/or motion salient in an image sequence [28], [29]. (ii) Semi-automatic VOS relies on an initial labelled frame and subsequently tracks and segments the initialized objects throughout the sequence [30], [31], [32]. (iii) Semantic VS is concerned with segmenting a finite set of semantic categories that are learned during training [20], [33].…”

Section: Related Workmentioning

confidence: 99%

“…Most closely related to the present work are the following examples. Semi-automatic VOS approaches have used unlabeled frames transductively to enforce temporal continuity [31], [32]. Earlier work on video semantic segmentation applied representation warping to fuse features from consecutive frames to ensure temporal consistency of the predictions in an inductive setting [33].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Temporal Transductive Inference for Few-Shot Video Object Segmentation

Siam¹,

Derpanis²,

Wildes³

2022

Preprint

View full text Add to dashboard Cite

Few-shot video object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training. In this paper, we present a simple but effective temporal transductive inference (TTI) approach that leverages temporal consistency in the unlabelled video frames during few-shot inference. Key to our approach is the use of both global and local temporal constraints. The objective of the global constraint is to learn consistent linear classifiers for novel classes across the image sequence, whereas the local constraint enforces the proportion of foreground/background regions in each frame to be coherent across a local temporal window. These constraints act as spatiotemporal regularizers during the transductive inference to increase temporal coherence and reduce overfitting on the few-shot support set. Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%. In addition, we introduce improved benchmarks that are exhaustively labelled (i.e. all object occurrences are labelled, unlike the currently available), and present a more realistic evaluation paradigm that targets data distribution shift between training and testing sets. Our empirical results and in-depth analysis confirm the added benefits of the proposed spatiotemporal regularizers to improve temporal coherence and overcome certain overfitting scenarios.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Temporal Transductive Inference for Few-Shot Video Object Segmentation

Siam¹,

Derpanis²,

Wildes³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…To address the context limitation, recent state-of-the-art methods use more past frames as feature memory [36,13,64,21,28,58,16]. Particularly, Space-Time Memory (STM) [36] is popular and has been extended by many follow-up works [43,8,18,54,50,31,9,44,33]. Among these extensions, we use STCN [9] as our working memory backbone as it is simple and effective.…”

Section: Related Workmentioning

confidence: 99%

“…AOT [60] is a recent work that extends the attention mechanism to transformers but does not solve the GPU memory explosion problem. Some methods [33,14] Fig. 2.…”

Section: Related Workmentioning

confidence: 99%

“…Table 1 tabulates the quantitative results, and Figure 1 (right) plots the short-term performance against the long-term performance. Methods that use a temporally local feature window (CFBI(+) [59,61], JOINT [33]) have a constant memory cost but fail when they lose track of the context. Methods with a fastgrowing memory bank (e.g., STM [36], AOT [60], STCN [9]) are forced to use a low feature memory insertion frequency and do not scale well to long videos.…”

Section: Long-time Video Datasetmentioning

confidence: 99%

See 1 more Smart Citation

XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model

Cheng¹,

Schwing²

2022

Preprint

View full text Add to dashboard Cite

We present XMem, a video object segmentation architecture for long videos with unified feature memory stores inspired by the Atkinson-Shiffrin memory model. Prior work on video object segmentation typically only uses one type of feature memory. For videos longer than a minute, a single feature memory model tightly links memory consumption and accuracy. In contrast, following the Atkinson-Shiffrin model, we develop an architecture that incorporates multiple independent yet deeply-connected feature memory stores: a rapidly updated sensory memory, a high-resolution working memory, and a compact thus sustained long-term memory. Crucially, we develop a memory potentiation algorithm that routinely consolidates actively used working memory elements into the long-term memory, which avoids memory explosion and minimizes performance decay for long-term prediction. Combined with a new memory reading mechanism, XMem greatly exceeds state-of-the-art performance on long-video datasets while being on par with state-of-theart methods (that do not work on long videos) on short-video datasets. 1 1 Code is available at hkchengrex.github.io/XMem

show abstract