SwiftNet: Real-time Video Object Segmentation

Wang, Haochen; Jiang, Xiaolong; Ren, Hongliang; Hu, Yao; Bai, Song

doi:10.1109/cvpr46437.2021.00135

Cited by 400 publications

(472 citation statements)

References 40 publications

(7 reference statements)

Supporting

Mentioning

472

Contrasting

Order By: Relevance

“…SwiftNet (Wang et al 2021a) improves STM in memory management and encoder architecture. The method utilises a similar strategy to AFB-URR to build the memory bank, but different update triggers are used.…”

Section: Pixel-level Matchingmentioning

confidence: 99%

Deep learning for video object segmentation: a review

Gao

Zheng

et al. 2022

Artif Intell Rev

View full text Add to dashboard Cite

As one of the fundamental problems in the field of video understanding, video object segmentation aims at segmenting objects of interest throughout the given video sequence. Recently, with the advancements of deep learning techniques, deep neural networks have shown outstanding performance improvements in many computer vision applications, with video object segmentation being one of the most advocated and intensively investigated. In this paper, we present a systematic review of the deep learning-based video segmentation literature, highlighting the pros and cons of each category of approaches. Concretely, we start by introducing the definition, background concepts and basic ideas of algorithms in this field. Subsequently, we summarise the datasets for training and testing a video object segmentation algorithm, as well as common challenges and evaluation metrics. Next, previous works are grouped and reviewed based on how they extract and use spatial and temporal features, where their architectures, contributions and the differences among each other are elaborated. At last, the quantitative and qualitative results of several representative methods on a dataset with many remaining challenges are provided and analysed, followed by further discussions on future research directions. This article is expected to serve as a tutorial and source of reference for learners intended to quickly grasp the current progress in this research area and practitioners interested in applying the video object segmentation methods to their problems. A public website is built to collect and track the related works in this field: https://github.com/gaomingqi/VOS-Review.

show abstract

Section: Pixel-level Matchingmentioning

confidence: 99%

Deep learning for video object segmentation: a review

Gao

Zheng

et al. 2022

Artif Intell Rev

View full text Add to dashboard Cite

show abstract

“…FEELVOS (Voigtlaender et al 2019) proposes a global and a local pixel-level matching mechanism to gather information from the first and previous frames, respectively. Recently, the STM network (Oh et al 2019) is proposed to propagate the non-local object information, which has been a solid baseline in VOS task for its simple architecture and competitive performance (Seong, Hyun, and Kim 2020;Wang et al 2021). GC (Li, Shen, and Shan 2020) improves the STM architecture by only using a fixed-size feature representation and updates a global context to guide the segmentation of current frame.…”

Section: Related Workmentioning

confidence: 99%

Siamese Network with Interactive Transformer for Video Object Segmentation

Zhang

He³

et al. 2022

AAAI

View full text Add to dashboard Cite

Semi-supervised video object segmentation (VOS) refers to segmenting the target object in remaining frames given its annotation in the first frame, which has been actively studied in recent years. The key challenge lies in finding effective ways to exploit the spatio-temporal context of past frames to help learn discriminative target representation of current frame. In this paper, we propose a novel Siamese network with a specifically designed interactive transformer, called SITVOS, to enable effective context propagation from historical to current frames. Technically, we use the transformer encoder and decoder to handle the past frames and current frame separately, i.e., the encoder encodes robust spatio-temporal context of target object from the past frames, while the decoder takes the feature embedding of current frame as the query to retrieve the target from the encoder output. To further enhance the target representation, a feature interaction module (FIM) is devised to promote the information flow between the encoder and decoder. Moreover, we employ the Siamese architecture to extract backbone features of both past and current frames, which enables feature reuse and is more efficient than existing methods. Experimental results on three challenging benchmarks validate the superiority of SITVOS over state-of-the-art methods. Code is available at https://github.com/LANMNG/SITVOS.

show abstract

“…Newly inferred frames can be added to the memory, and then the algorithm propagates forward in time. Derivatives either apply STM at other tasks [20,54], improve the training data or augmentation policy [20,21], augment the memory readout process [15,20,21,23,24], use optical flow [25], or reduce the size of the memory bank by limiting its growth [22,26].…”

Section: Related Workmentioning

confidence: 99%

“…Most current methods either fit a model using the initial segmentation [4,5,6,7,8] or leverage temporal propagation [9,10,11,12,13,14,15], particularly with spatio-temporal matching [16,17,18,19,20,21,22,23]. Space-Time Memory networks [17] are especially popular recently due to its high performance and simplicity -many variants [21,15,22,20,23,24,25,26], including competitions' winners [27,28], have been developed to improve the speed, reduce memory usage, or to regularize the memory readout process of STM.…”

Section: Introductionmentioning

confidence: 99%

Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation

Cheng¹,

Tai²,

Tang³

2021

Preprint

View full text Add to dashboard Cite

This paper presents a simple yet effective approach to modeling space-time correspondences in the context of video object segmentation. Unlike most existing approaches, we establish correspondences directly between frames without reencoding the mask features for every object, leading to a highly efficient and robust framework. With the correspondences, every node in the current query frame is inferred by aggregating features from the past in an associative fashion. We cast the aggregation process as a voting problem and find that the existing inner-product affinity leads to poor use of memory with a small (fixed) subset of memory nodes dominating the votes, regardless of the query. In light of this phenomenon, we propose using the negative squared Euclidean distance instead to compute the affinities. We validated that every memory node now has a chance to contribute, and experimentally showed that such diversified voting is beneficial to both memory efficiency and inference accuracy. The synergy of correspondence networks and diversified voting works exceedingly well, achieves new state-of-the-art results on both DAVIS and YouTubeVOS datasets while running significantly faster at 20+ FPS for multiple objects without bells and whistles.

show abstract

SwiftNet: Real-time Video Object Segmentation

Cited by 400 publications

References 40 publications

Deep learning for video object segmentation: a review

Deep learning for video object segmentation: a review

Siamese Network with Interactive Transformer for Video Object Segmentation

Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation

Contact Info

Product

Resources

About