URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark

Seo, Seonguk; Lee, Joon-Young; Han, Bohyung

doi:10.1007/978-3-030-58555-6_13

Cited by 80 publications

(70 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the A2D-Sentences and JHMDB-Sentences [11] datasets, MTTR significantly outperforms all existing methods across all metrics. Moreover, we report strong results on the public validation set of Refer-YouTube-VOS [37], a more challenging dataset that has yet to receive attention in the literature.…”

Section: Multimodal Transformermentioning

confidence: 76%

“…As mentioned earlier, this subset contains only the more challenging full-video expressions from the original release of Refer-YouTube-VOS. Compared with existing methods [24,37] which trained and evaluated on the full version of the dataset, our model demonstrates superior performance across all metrics despite being trained on less data and evaluated exclusively on a more challenging subset. Additionally, our method shows competitive performance compared with the methods that led in the 2021 RVOS competition [8,20].…”

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 94%

“…We further evaluate MTTR on the more challenging Refer-YouTube-VOS dataset, introduced by Seo et al [37], who provided textual annotations for the original YouTube-VOS dataset [47]. Each video has pixel-level instance segmentation annotations for every fifth frame.…”

Section: Methodsmentioning

confidence: 99%

“…To improve positional relation representations in the text, PRPE [32] explored a positional encoding mechanism based on the polar coordinate system. URVOS [37] improved tracking capabilities by performing language-based object segmentation using the key frame in the video and propagating the predicted mask throughout the video. Differently from others, AAMN [48] utilized a top-down approach where an off-the-shelf object detector is used to localize objects in the video prior to parsing relations between visual and textual features.…”

Section: Related Workmentioning

confidence: 99%

“…Each frame is resized such that the shorter side is at least 320 pixels and the longer side is at most 576 pixels. For Refer-YouTube-VOS [37], we use windows of w = 12 consecutive annotated frames during training, and full-length videos (up to 36 annotated frames) during evaluation. Each frame is resized such that the shorter side is at least 360 pixels and the longer side is at most 640 pixels.…”

Section: Implementation Detailsmentioning

confidence: 99%

See 4 more Smart Citations

End-to-End Referring Video Object Segmentation with Multimodal Transformers

Zheltonozhskii¹,

Baskin²

2021

Preprint

View full text Add to dashboard Cite

The referring video object segmentation task (RVOS) involves segmentation of a text-referred object instance in the frames of a given video. Due to the complex nature of this multimodal task, which combines text reasoning, video understanding, instance segmentation and tracking, existing approaches typically rely on sophisticated pipelines in order to tackle it. In this paper, we propose a simple Transformer-based approach to RVOS. Our framework, termed Multimodal Tracking Transformer (MTTR), models the RVOS task as a sequence prediction problem. Following recent advancements in computer vision and natural language processing, MTTR is based on the realization that video and text can both be processed together effectively and elegantly by a single multimodal Transformer model. MTTR is end-to-end trainable, free of text-related inductive bias components and requires no additional maskrefinement post-processing steps. As such, it simplifies the RVOS pipeline considerably compared to existing methods. Evaluation on standard benchmarks reveals that MTTR significantly outperforms previous art across multiple metrics. In particular, MTTR shows impressive +5.7 and +5.0 mAP gains on the A2D-Sentences and JHMDB-Sentences datasets respectively, while processing 76 frames per second. In addition, we report strong results on the public validation set of Refer-YouTube-VOS, a more challenging RVOS dataset that has yet to receive the attention of researchers. The code to reproduce our experiments is available at https://github.com/mttr2021/MTTR.

show abstract

Section: Multimodal Transformermentioning

confidence: 76%

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 94%