2021
DOI: 10.1109/tpami.2021.3079993
|View full text |Cite
|
Sign up to set email alerts
|

Cross-Modal Progressive Comprehension for Referring Segmentation

Abstract: Given a natural language expression and an image/video, the goal of referring segmentation is to produce the pixellevel masks of the entities described by the subject of the expression. Previous approaches tackle this problem by implicit feature interaction and fusion between visual and linguistic modalities in a one-stage manner. However, human tends to solve the referring problem in a progressive manner based on informative words in the expression, i.e., first roughly locating candidate entities and then dis… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
29
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 59 publications
(29 citation statements)
references
References 58 publications
0
29
0
Order By: Relevance
“…Yu et al [34] proposed a modular network which decomposes the input natural language description into subject, location, and relationship attributes to improve the localization performance. Liu et al [35] adopted graph models with an attention mechanism to capture the relationship between the object regions in the given image. In association with visual affordance, Mi et al [36], [37] investigated the use of natural language to guide visual affordance detection.…”
Section: B Referring Expression Groundingmentioning
confidence: 99%
See 4 more Smart Citations
“…Yu et al [34] proposed a modular network which decomposes the input natural language description into subject, location, and relationship attributes to improve the localization performance. Liu et al [35] adopted graph models with an attention mechanism to capture the relationship between the object regions in the given image. In association with visual affordance, Mi et al [36], [37] investigated the use of natural language to guide visual affordance detection.…”
Section: B Referring Expression Groundingmentioning
confidence: 99%
“…Hu et al [22] designed a bi-directional relationship inferring network to model the relationship between linguistic and visual features. Liu et al [35] proposed a model that first perceives all the entities in the image according to the entity and attribution words in the expression, then infers the location of the target object with the words that represent the relationship. Jing et al [23] first gets the position prior of the referring object based on the language and image, then generates segmentation mask based on the position prior.…”
Section: (Inherited Hypernym)mentioning
confidence: 99%
See 3 more Smart Citations