2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.00491
|View full text |Cite
|
Sign up to set email alerts
|

Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 29 publications
(10 citation statements)
references
References 28 publications
0
9
0
Order By: Relevance
“…• Referring VOS. Referring video object segmentation [44,45,46,47,48,49,50] is an emerging setting that involves multi-modal information. It gives a natural language expression to indicate the target object and aims at segmenting the target object throughout the video clips.…”
Section: Video Object Segmentation (Vos)mentioning
confidence: 99%
“…• Referring VOS. Referring video object segmentation [44,45,46,47,48,49,50] is an emerging setting that involves multi-modal information. It gives a natural language expression to indicate the target object and aims at segmenting the target object throughout the video clips.…”
Section: Video Object Segmentation (Vos)mentioning
confidence: 99%
“…They fuse visual and linguistic modalities on early features instead of proposals, whereas the fusion strategies concentrate on employing a cross-modal attention mechanism. Additionally, some works provide better semantic alignment interpretability via graph modeling [49,50], progressive reasoning [11,16,51], or multi-temporal-range learning [7,12,46]. More recently, the Transformer-based models [2,14,18,47,48] are becoming popular due to their powerful representation ability in cross-modal understanding.…”
Section: Automatical Labelingmentioning
confidence: 99%
“…Ye et al [13] proposed three novel modules: cross-modal self-attention, gated multilevel fusion, and cross-frame self-attention. Ding et al [14] proposed language-bridged duplex transfer to utilize language as an intermediary bridge to solve spatial misalignments or false distractors. Li et al [15] proposed a meta-transfer module for transferring target information from the language domain to the image domain.…”
Section: Related Work 21 Language-guided Video Object Segmentationmentioning
confidence: 99%