XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model

Cheng, Ho Kei; Schwing, Alexander G.

doi:10.1007/978-3-031-19815-1_37

Cited by 127 publications

(84 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The experimental results show that videos of complex scenes make the current state-of-the-art VOS methods less pronounced, especially in terms of tracking objects that disappear for a while due to occlusions. For example, the  & performance of XMem [10] on DAVIS 2016 is 92.0% but drop to 57.6% on MOSE, the  & performance of DeAOT [11] on DAVIS 2016 is 92.9% but drop to 59.4% on MOSE, which consistently reveal the difficulties brought by complex scenes.…”

Section: Introductionmentioning

confidence: 87%

“…Annotators use the tool to load and preview videos and first-frame masks, annotate and visualize the segmentation masks in the subsequent frames, and save them. The annotation tool also has a built-in interactive object segmentation network XMem [10], to assist annotations in producing high-quality masks. To ensure the annotation quality under complex scenes, the annotators are required to clearly track the object that disappears and reappears due to heavy occlusions and crowd.…”

Section: Video Collection and Annotationmentioning

confidence: 99%

“…Current state-of-the-art methods have achieved very high performance on two of the most commonly-used VOS datasets DAVIS [1,2] and YouTube-VOS [3]. For example, XMem [10] achieves 92.0%  & on DAVIS 2016 [1], 87.7%  & on DAVIS 2017 [2], and 86.1%  on YouTube-VOS [3]. With such a high performance, it seems that the video object segmentation has been well resolved.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

MOSE: A New Dataset for Video Object Segmentation in Complex Scenes

Ding¹,

Liu²,

He³

et al. 2023

Preprint

View full text Add to dashboard Cite

4 ByteDance https://henghuiding.github.io/MOSE Figure 1. Examples of video clips from the coMplex video Object SEgmentation (MOSE) dataset. The selected target objects are masked in orange ◼. The most notable feature of MOSE is complex scenes, including the disappearance-reappearance of objects, small/inconspicuous objects, heavy occlusions, crowded environments, etc. For example, the target player in the 2nd row turns around when reappearing in the 4th and 5th columns after disappearing in the 3rd column, bringing challenges in re-identifying him. Most videos in MOSE contain crowded and occluded objects with the target object seldom being the salient one. The goal of MOSE dataset is to provide a platform that promotes the development of more comprehensive and robust video object segmentation algorithms.

show abstract

Section: Introductionmentioning

confidence: 87%

Section: Video Collection and Annotationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

MOSE: A New Dataset for Video Object Segmentation in Complex Scenes

Ding¹,

Liu²,

He³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…of the out-of-distribution (OOD) content for each frame, we aim to reconstruct the video with EG3D inversion and perform face editing. For each video, we label the first frame for M 1 , and use an off-the-shelf tracking algorithm [12] to propagate it to obtain other masks M's.…”

Section: Methodsmentioning

confidence: 99%

“…After that, we convert them to EG3D's 5-point landmarks and crop the face out of the input frame. For the segmentation masks, we manually label the first frame and then use an off-theshelf tracking algorithm [12] to get the masks for the rest of frames.…”

Section: Methodsmentioning

confidence: 99%

In-N-Out: Face Video Inversion and Editing with Volumetric Decomposition

Xu¹,

Shu²,

Smith³

et al. 2023

Preprint

View full text Add to dashboard Cite

Autoencoding a Soft Touch to Learn Grasping from On‐Land to Underwater

Guo,

Han,

Liu

et al. 2023

Advanced Intelligent Systems

View full text Add to dashboard Cite

Robots play a critical role as the physical agent of human operators in exploring the ocean. However, it remains challenging to grasp objects reliably while fully submerging under a highly pressurized aquatic environment with little visible light, mainly due to the fluidic interference on the tactile mechanics between the finger and object surfaces. This study investigates the transferability of grasping knowledge from on‐land to underwater via a vision‐based soft robotic finger that learns 6D forces and torques (FT) using a supervised variational autoencoder (SVAE). A high‐framerate camera captures the whole‐body deformations while a soft robotic finger interacts with physical objects on‐land and underwater. Results show that the trained SVAE model learns a series of latent representations of the soft mechanics transferable from land to water, presenting a superior adaptation to the changing environments against commercial FT sensors. Soft, delicate, and reactive grasping enabled by tactile intelligence enhances the gripper's underwater interaction with improved reliability and robustness at a much‐reduced cost, paving the path for learning‐based intelligent grasping to support fundamental scientific discoveries in environmental and ocean research.

show abstract

XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model

Cited by 127 publications

References 52 publications

MOSE: A New Dataset for Video Object Segmentation in Complex Scenes

MOSE: A New Dataset for Video Object Segmentation in Complex Scenes

In-N-Out: Face Video Inversion and Editing with Volumetric Decomposition

Autoencoding a Soft Touch to Learn Grasping from On‐Land to Underwater

Contact Info

Product

Resources

About