Video Object Segmentation Using Space-Time Memory Networks

Oh, Seoung Wug; Lee, Joon-Young; Xu, Ning; Kim, Seon Joo

doi:10.1109/iccv.2019.00932

Cited by 531 publications

(453 citation statements)

References 38 publications

(136 reference statements)

Supporting

Mentioning

417

Contrasting

Order By: Relevance

“…However, this strategy easily leads to overfitting to the initial target appearance and impractically long run-times. More recent methods [34,13,32,23,36,24,17] therefore integrate target-specific appearance models into the segmentation architecture. In addition to improved run-times, many of these methods can also benefit from full end-to-end learning, which has been shown to have a crucial impact on performance [32,14,24].…”

Section: Related Workmentioning

confidence: 99%

“…While most state-of-the-art VOS approaches employ similar image feature extractors and segmentation heads, the advances in how to capture and utilize target information has led to much improved performance [14,32,24,28]. A Fig.…”

Section: Introductionmentioning

confidence: 99%

“…promising direction is to employ feature matching techniques [13,14,32,24] in order to compare the reference frame with new images to segment. Such feature feature-matching layers greatly benefit from their efficiency and differentiability.…”

Section: Introductionmentioning

confidence: 99%

“…Such feature feature-matching layers greatly benefit from their efficiency and differentiability. This allows the design of fully end-to-end trainable architectures, which has been shown to be important for segmentation performance [14,32,24]. On the other hand, feature matching relies on a powerful and generic feature embedding, which may limit its performance in challenging scenarios.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Learning What to Learn for Video Object Segmentation

Bhat

Lawin

Danelljan

et al. 2020

Lecture Notes in Computer Science

108

102

View full text Add to dashboard Cite

Video object segmentation (VOS) is a highly challenging problem, since the target object is only defined during inference with a given first-frame reference mask. The problem of how to capture and utilize this limited target information remains a fundamental research question. We address this by introducing an end-to-end trainable VOS architecture that integrates a differentiable few-shot learning module. This internal learner is designed to predict a powerful parametric model of the target by minimizing a segmentation error in the first frame. We further go beyond standard few-shot learning techniques by learning what the few-shot learner should learn. This allows us to achieve a rich internal representation of the target in the current frame, significantly increasing the segmentation accuracy of our approach. We perform extensive experiments on multiple benchmarks. Our approach sets a new state-ofthe-art on the large-scale YouTube-VOS 2018 dataset by achieving an overall score of 81.5, corresponding to a 2.6% relative improvement over the previous best result.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Learning What to Learn for Video Object Segmentation

Bhat

Lawin

Danelljan

et al. 2020

Lecture Notes in Computer Science

108

102

View full text Add to dashboard Cite

show abstract

“…Inspired by recent advances in related video tasks, e.g., dense connections for spatio-temporal interaction in action recognition [39] and space-time memory block in video object segmentation [40], we possibly consider these techniques to avoid the latent dependency issues in video SOD. Learning spatial-temporal features in an end-to-end manner is important for further accuracy improvement.…”

Section: Promising Future Workmentioning

confidence: 99%

Overview of deep-learning based methods for salient object detection in videos

Wang

Zhang

et al. 2020

Pattern Recognition

View full text Add to dashboard Cite

Video salient object detection is a challenging and important problem in computer vision domain. In recent years, deep-learning based methods have contributed to significant improvements in this domain. This paper provides an overview of recent developments in this domain and compares the corresponding methods up to date, including 1) classification of the state-of-the-art methods and their frameworks; 2) summary of the benchmark datasets and commonly used evaluation metrics; 3) experimental comparison of the performances of the state-of-the-art methods; 4) suggestions of some promising future works for unsolved challenges.

show abstract

Local Context Embedding Neural Network for Scene Semantic Segmentation

Dai

Ding

et al. 2019

Pattern Recognition and Computer Vision

View full text Add to dashboard Cite

Surgical context inference has recently garnered significant attention in robot-assisted surgery as it can facilitate workflow analysis, skill assessment, and error detection. However, runtime context inference is challenging since it requires timely and accurate detection of the interactions among the tools and objects in the surgical scene based on the segmentation of video data. On the other hand, existing stateof-the-art video segmentation methods are often biased against infrequent classes and fail to provide temporal consistency for segmented masks. This can negatively impact the context inference and accurate detection of critical states. In this study, we propose a solution to these challenges using a Space-Time Correspondence Network (STCN). STCN is a memory network that performs binary segmentation and minimizes the effects of class imbalance. The use of a memory bank in STCN allows for the utilization of past image and segmentation information, thereby ensuring consistency of the masks. Our experiments using the publicly-available JIGSAWS dataset demonstrate that STCN achieves superior segmentation performance for objects that are difficult to segment, such as needle and thread, and improves context inference compared to the state-of-the-art. We also demonstrate that segmentation and context inference can be performed at runtime without compromising performance.

show abstract

Video Object Segmentation Using Space-Time Memory Networks

Cited by 531 publications

References 38 publications

Learning What to Learn for Video Object Segmentation

Learning What to Learn for Video Object Segmentation

Overview of deep-learning based methods for salient object detection in videos

Local Context Embedding Neural Network for Scene Semantic Segmentation

Contact Info

Product

Resources

About