Kevin Duarte scite author profile

In this work we propose a capsule-based approach for semi-supervised video object segmentation. Current video object segmentation methods are frame-based and often require optical flow to capture temporal consistency across frames which can be difficult to compute. To this end, we propose a video based capsule network, CapsuleVOS, which can segment several frames at once conditioned on a reference frame and segmentation mask. This conditioning is performed through a novel routing algorithm for attention-based efficient capsule selection. We address two challenging issues in video object segmentation: 1) segmentation of small objects and 2) occlusion of objects across time. The issue of segmenting small objects is addressed with a zooming module which allows the network to process small spatial regions of the video. Apart from this, the framework utilizes a novel memory module based on recurrent networks which helps in tracking objects when they move out of frame or are occluded. The network is trained end-to-end and we demonstrate its effectiveness on two benchmark video object segmentation datasets; it outperforms current offline approaches on the Youtube-VOS dataset while having a run-time that is almost twice as fast as competing methods. The code is publicly available at https://github.com/KevinDuarte/CapsuleVOS.

show abstract

Visual-Textual Capsule Routing for Text-Based Video Segmentation

McIntosh

Duarte

Rawat

et al. 2020

View full text Add to dashboard Cite

Modeling Multi-Label Action Dependencies for Temporal Action Localization

Tirupattur

Duarte

Rawat

et al. 2021

View full text Add to dashboard Cite

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Chen

Rouditchenko²,

Duarte

et al. 2021

View full text Add to dashboard Cite

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Chen¹,

Rouditchenko²,

Duarte³

et al. 2021

Preprint

View full text Add to dashboard Cite

Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities. In this context, this paper proposes a self-supervised training framework that learns a common multimodal embedding space that, in addition to sharing representations across different modalities, enforces a grouping of semantically similar instances. To this end, we extend the concept of instance-level contrastive learning with a multimodal clustering step in the training pipeline to capture semantic similarities across modalities. The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains. To evaluate our approach, we train our model on the HowTo100M dataset and evaluate its zero-shot retrieval capabilities in two challenging domains, namely text-to-video retrieval, and temporal action localization, showing state-ofthe-art results on four different datasets.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Kevin Duarte

CapsuleVOS: Semi-Supervised Video Object Segmentation Using Capsule Routing

Visual-Textual Capsule Routing for Text-Based Video Segmentation

Modeling Multi-Label Action Dependencies for Temporal Action Localization

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Contact Info

Product

Resources

About