Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation

Ding, Zihan; Hui, Tianrui; Huang, Junshi; Wei, Xiaoming; Han, Jizhong; Liu, Si

doi:10.1109/cvpr52688.2022.00491

Cited by 29 publications

(10 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• Referring VOS. Referring video object segmentation [44,45,46,47,48,49,50] is an emerging setting that involves multi-modal information. It gives a natural language expression to indicate the target object and aims at segmenting the target object throughout the video clips.…”

Section: Video Object Segmentation (Vos)mentioning

confidence: 99%

MOSE: A New Dataset for Video Object Segmentation in Complex Scenes

Ding¹,

Liu²,

He³

et al. 2023

Preprint

View full text Add to dashboard Cite

4 ByteDance https://henghuiding.github.io/MOSE Figure 1. Examples of video clips from the coMplex video Object SEgmentation (MOSE) dataset. The selected target objects are masked in orange ◼. The most notable feature of MOSE is complex scenes, including the disappearance-reappearance of objects, small/inconspicuous objects, heavy occlusions, crowded environments, etc. For example, the target player in the 2nd row turns around when reappearing in the 4th and 5th columns after disappearing in the 3rd column, bringing challenges in re-identifying him. Most videos in MOSE contain crowded and occluded objects with the target object seldom being the salient one. The goal of MOSE dataset is to provide a platform that promotes the development of more comprehensive and robust video object segmentation algorithms.

show abstract

Section: Video Object Segmentation (Vos)mentioning

confidence: 99%

MOSE: A New Dataset for Video Object Segmentation in Complex Scenes

Ding¹,

Liu²,

He³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…They fuse visual and linguistic modalities on early features instead of proposals, whereas the fusion strategies concentrate on employing a cross-modal attention mechanism. Additionally, some works provide better semantic alignment interpretability via graph modeling [49,50], progressive reasoning [11,16,51], or multi-temporal-range learning [7,12,46]. More recently, the Transformer-based models [2,14,18,47,48] are becoming popular due to their powerful representation ability in cross-modal understanding.…”

Section: Automatical Labelingmentioning

confidence: 99%

Referring Multi-Object Tracking

Wu¹,

Han²,

Wang³

et al. 2023

Preprint

View full text Add to dashboard Cite

Existing referring understanding tasks tend to involve the detection of a single text-referred object. In this paper, we propose a new and general referring understanding task, termed referring multi-object tracking (RMOT). Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking. To the best of our knowledge, it is the first work to achieve an arbitrary number of referent object predictions in videos. To push forward RMOT, we construct one benchmark with scalable expressions based on KITTI, named Refer-KITTI. Specifically, it provides 18 videos with 818 expressions, and each expression in a video is annotated with an average of 10.7 objects. Further, we develop a transformer-based architecture TransRMOT to tackle the new task in an online manner, which achieves impressive detection performance and outperforms other counterparts.

show abstract

“…Ye et al [13] proposed three novel modules: cross-modal self-attention, gated multilevel fusion, and cross-frame self-attention. Ding et al [14] proposed language-bridged duplex transfer to utilize language as an intermediary bridge to solve spatial misalignments or false distractors. Li et al [15] proposed a meta-transfer module for transferring target information from the language domain to the image domain.…”

Section: Related Work 21 Language-guided Video Object Segmentationmentioning

confidence: 99%

Video Object Segmentation Using Multi-Scale Attention-Based Siamese Network

et al. 2023

View full text Add to dashboard Cite

Video target segmentation is a fundamental problem in computer vision that aims to segment targets from a background by learning their appearance information and movement information. In this study, a video target segmentation network based on the Siamese structure was proposed. This network has two inputs: the current video frame, used as the main input, and the adjacent frame, used as the auxiliary input. The processing modules for the inputs use the same structure, optimization strategy, and encoder weights. The input is encoded to obtain features with different resolutions, from which good target appearance features can be obtained. After processing using the encoding layer, the motion features of the target are learned using a multi-scale feature fusion decoder based on an attention mechanism. The final predicted segmentation results were calculated from a layer of decoded features. The video object segmentation framework proposed in this study achieved optimal results on CDNet2014 and FBMS-3D, with scores of 78.36 and 86.71, respectively. It outperformed the second-ranked method by 4.3 on the CDNet2014 dataset and by 0.77 on the FBMS-3D dataset. Suboptimal results were achieved on the video primary target segmentation datasets SegTrackV2 and DAVIS2016, with scores of 60.57 and 81.08, respectively.

show abstract

Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation

Cited by 29 publications

References 28 publications

MOSE: A New Dataset for Video Object Segmentation in Complex Scenes

MOSE: A New Dataset for Video Object Segmentation in Complex Scenes

Referring Multi-Object Tracking

Video Object Segmentation Using Multi-Scale Attention-Based Siamese Network

Contact Info

Product

Resources

About